Opened 5 years ago

Last modified 5 months ago

#10454 in-progress task

New scheduler: substantial performance drop

Reported by: stippi Owned by: pdziepak
Priority: high Milestone: R1
Component: System/Kernel Version: R1/Development
Keywords: Cc:
Blocked By: #10487 Blocking:
Has a Patch: no Platform: All

Description (last modified by stippi)

I've already made Pawel aware of it, but wanted it logged, that the new scheduler comes with a substantial drop in performance for certain work loads. I've timed the Haiku revision before the merge building HaikuDepot (after jam clean). These are my timings:

jam -q -j4 HaikuDepot

Haiku old scheduler:
real    0m58.340s
user    2m25.305s
sys     0m52.408s

real    0m57.608s
user    2m25.175s
sys     0m52.455s


Haiku, new scheduler:
real    1m40.426s
user    2m23.923s
sys     0m42.496s

real    1m40.283s
user    2m23.842s
sys     0m42.488s


Ubuntu 12.04:
real	0m24.603s
user	1m5.592s
sys	0m5.376s

real	0m24.620s
user	1m5.720s
sys	0m5.268s

The system time actually improved, but the overall time for the concurrent jobs is much worse than before. Looing at ActivityMonitor, the thread migration between cores seems to be the problem, because each core peaks only for short times, while with the old scheduler, the cores stay peaked.

Both Haiku images had been compiled with KDEBUG_LEVEL = 0 and the code was compiled on a BFS partition that has query support. In case of Linux, the code was on a ReiserFS partition on the same SSD. Compiler was GCC2 in all tests.

Attachments (1)

screenshot_clipping.png (46.6 KB) - added by stippi 5 years ago.
Activity Monitor graphs for hrev46825

Download all attachments as: .zip

Change History (28)

comment:1 Changed 5 years ago by stippi

Description: modified (diff)

comment:2 Changed 5 years ago by stippi

Description: modified (diff)

comment:3 Changed 5 years ago by stippi

Description: modified (diff)

comment:4 Changed 5 years ago by bonefish

Just to rule out this is a time tracking glitch, did you check the real time on wall clock for comparison?

Both user and system time are comparable or better. The detrimental effect of core hopping is mainly with respect to cache efficiency, which would increase user and/or system time. It wouldn't explain why only the real time is worse. Additional wait times and scheduling latencies would be an explanation. A stronger CPU affinity usually comes at the cost of the latter. The trick is to find a balance where the positive effects maximally outweigh the negative ones in almost all situations.

I don't know if scheduling_recorder and DebugAnalyzer do still work. They would be very handy for analyzing such a problem.

comment:5 Changed 5 years ago by pdziepak

Status: newin-progress

comment:6 Changed 5 years ago by pdziepak

Blocked By: 10487 added

comment:7 Changed 5 years ago by stippi

Tested hrev46810 for the effects of recent changes that seemed to have to do with this ticket. It got a little better and a little worse:

real    1m44.925s
user    2m24.451s
sys     0m37.439s

real    1m44.349s
user    2m24.526s
sys     0m37.569s

System time has improved yet again, while overall time for this task takes almost 5 seconds more since the initial merge. I am not even sure the observation that it has mainly to do with thread migration is even correct. That would mean that there should be one or two cores that are peaked during an interval, while others have nothing to do. But all my four cores have mostly nothing to do while the compile runs. Activity Monitor reports loads as low as 17 % during a half-second average for a core, and no core goes beyond 70% load.

Last edited 5 years ago by stippi (previous) (diff)

comment:8 Changed 5 years ago by stippi

I am almost convinced this has nothing to do with thread migration. First of all, it was news to me that the observed behavior is apparently hardware-specific, for other people it works just fine and the new scheduler is faster, some are in the same boat as me. Secondly, I am observing "short freezes", which I now think may be related: I mostly notice that while typing in Pe, sometimes the text caret does not react, and then several chars that I just typed in the last second or two appear at once... so somewhere, sometimes a delay happens that prevents threads from running when they should. Maybe the delay is always there but varies in length, with the average length having the observed effect on overall compile time, and the longest length giving a noticeable delay while typing.

My hardware is a Core i5 quad:

Intel(R) Core(TM) i5-3570 CPU @ 3.40GHz

family(6) model(58) stepping(9)

fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx rdtscp lm constant_tsc arch_perfmon pebs bts xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms

Anything else I should follow up on to help investigate?

comment:9 Changed 5 years ago by pulkomandy

a delay/lockup like this may be caused by other hardware in the machine (with interrupts, or waiting for disk access, or something else). Listdev output on affected machines may help pinning the problem.

comment:10 Changed 5 years ago by pdziepak

Well, thread migration definitely is not working properly when it has to deal with a large amount of short lived tasks. hrev46810 contains only work toward fixing that issue not a complete fix. Also, the fact that none of the cores reaches 100% load in a half-second average doesn't tell much what is really happening there. The scheduler operates on much shorter time intervals. I have several patches that help to mitigate this thread migration problem and I will push at least some of them later today.

"Short freezes" are rather more worrying issue. Are they system-wide or affect just one application? Or perhaps only mouse/keyboard input is affected? If you can enter KDL during one of such freezes it would be interesting to see if there are any running or ready threads, how they are assigned to cores and how cores are ranked (useful commands: "running", "ready", "run_queue", "cpu_heap").

comment:11 Changed 5 years ago by stippi

It is hard to tell whether the short freezes are related. I have definitely seen freezes in mouse events (1 - 3 secs) on this machine from time to time (before the scheduler merge). These are USB related, the syslog has entries similar to "USB port reset". It gets really bad when I connect my Wacom Intuos 2 tablet. This is definitely unrelated. The short delays while typing only happen very infrequently and I can say that I got aware of them after the scheduler merge, but they might have been there before and might be related to the mouse freezes. I can keep a tail -F syslog open the next time I do some Haiku work. That should clear up at least this issue.

My musings about the mysterious latency might not make much sense. I told you that when I run the WonderBrush prototype which uses as much render threads as there are CPU cores, I see those cores all pegged. They wouldn't be if threads are not always allowed to run when they can.

To think about the problem as something thread migration related still has some question marks for me. A new task should be scheduled immediately, even when it is short lived. It is either scheduled on a CPU with less load, or on a CPU which is already full, in which case it runs later than it could. If this theory is based on facts, then there should always be at least one CPU under full load. You are saying this may still be the case, only that Activity Monitor misleads by measuring load over such a long interval?

comment:12 Changed 5 years ago by pdziepak

There was a delay before thread migration and effect of that migration becoming visible in the load statistics. That caused scheduler in some circumstances to migrate virtually all active threads to a single core (the best fit at that specific moment) then, after updating load statistics, the scheduler would realize that there is an overloaded core and migrate most of the active threads to a single core that currently is the best fit. Basically, this would cause each core to be overloaded for a short period of time while the others are idle. Average load would look like none of the cores reaches 100% load.

The situation will eventually stabilize as load statistics updates were not synchronized in any way with thread migration and that's why the problem doesn't significantly affect long lived threads (e.g. that WonderBrush prototype).

Anyway, hrev46824 should make things at least a bit better.

comment:13 Changed 5 years ago by stippi

It definitely improved things quite a bit (second and third runs):

real    1m5.878s
user    2m25.134s
sys     0m41.105s

real    1m5.652s
user    2m25.156s
sys     0m40.971s

The effect is that load is spread across all cores much more consistently. However, there are some curious effects still to be observed. When the compile test begins, after jam collected targets, load jumps to between 80% and 90% on all cores, but slowly decreases over the duration of the test. It could be dependent on what exactly is compiling, as there is a short dip of load at a consistent time during the compile. I am attaching a screenshot of the Activity Monitor graphs.

Changed 5 years ago by stippi

Attachment: screenshot_clipping.png added

Activity Monitor graphs for hrev46825

comment:14 Changed 5 years ago by jua

As of hrev46944 scheduling_recorder works again. As Ingo pointed out, it might be really useful to find out what is happening here.

comment:15 Changed 5 years ago by pdziepak

I've noticed recently that the scheduler tends to keep one core unoccupied despite having correct core load statistics. That behaviour is not shown on the graph Stephan posted, but still, it is apparently some oversight in thread migration logic.

There is also another potential cause for these performance problems which, actually, has already worried me for some time, but I never looked into it more closely. When a core is chosen for a thread (in low latency mode) the scheduler first considers cores that are currently idle. The purpose of that is to allow the thread to run immediately. However, core with 90% is idle for 10% of the time and in worst-case scenario may be chosen by the scheduler even if there were much less occupied cores. The fix for that shouldn't be very complicated (testing the impact of that fix probably would take more time, though), all that needs to be done is not allow cores with a load above certain threshold in idle core list.

I am going to address these issues in a few days.

comment:16 Changed 5 years ago by pdziepak

hrev46947 and hrev46999 address some oversights that might have quite noticeable impact on the performance (namely, the problem with leaving one core unoccupied). I did quick UnixBench run to verify that there is no regression anywhere but I have not done any proper compile performance tests yet. There is still the other potential problem I mentioned in a previous post.

comment:17 Changed 5 years ago by stippi

These are the timings for hrev46947:

real    1m6.405s
user    2m25.816s
sys     0m41.047s

real    1m6.131s
user    2m25.741s
sys     0m40.925s

And these are the timings for hrev47005:

real    1m4.812s
user    2m26.094s
sys     0m41.554s

real    1m4.639s
user    2m26.000s
sys     0m41.384s

... so a slight improvement. :-)

comment:18 Changed 4 years ago by fishpond

The scheduling keeps my core I7 quad system unusable for any sort of media processing or games with sound. Around 20 seconds after booting up, menu navigation becomes unresponsive. Really looks as if especially small tasks get delayed for multiple seconds sometimes, including audio buffer transfers, keyboard and mouse click responses etc.. Building Haiku in parallel or other batch tasks with single-threaded parallel processes fly though.

Pretty sure I spotted one bug in scheduler.cpp - scheduler_set_thread_priority, line 184ff. where the test on the new priority being different to the current on occurs after the new on is set, so that the priority update never really happens.

Will look into other points but that should probably be fixed, simply by checking against oldPriority instead of thread->priority in line 184.

comment:19 in reply to:  18 Changed 4 years ago by korli

Replying to fishpond:

Pretty sure I spotted one bug in scheduler.cpp - scheduler_set_thread_priority, line 184ff. where the test on the new priority being different to the current on occurs after the new on is set, so that the priority update never really happens.

Reviewing the last commit that does affect this code (hrev46809), you're right, it seems the check is now wrong.

comment:20 Changed 3 years ago by fishpond

Problem is fixed for me - I suspect the fix to #12286 being responsible. I'd expect the fix to cover quite a range of existing tickets.

Totally changes usability of Haiku to the positive for me - thanks a lot.

comment:21 Changed 3 years ago by pulkomandy

#12286 fixed a regression introduced very recently (this month) in Haiku. So it's definitely not related to this. Good to know your issue is fixed, still.

Leaving this open, unless stippi can run the test again on the same hardware and confirm the performance is back to what it was before.

comment:22 Changed 3 years ago by fishpond

Hi Pulkomandy - sorry, more boots showed that neither #12237 ([launch_daemon] boot hangs at rocket icon) nor this one have really gone away. The strange thing was that on two boots that were successfully passing the rocket icon, the system was not showing stutter anymore. As far as I can see, the launch_daemon operates with interdependent control and worker threads, which would make it especially vulnerable to a thread not starting. So, I tend to suspect the two problems to be interrelated. I'll try some more things and will keep results posted here.

comment:23 Changed 5 months ago by waddlesplash

Fixed the check in hrev52094. Slight performance increase for me.

comment:24 Changed 5 months ago by stippi

I think the computer I am typing this on is still the same one. I need to build my own Haiku image in order for it to be comparable (KDEDUG=0). Then I need to check out the same revision of HaikuDepot used for the test and run a compile. I hope such old Haiku code bases still compile on the current Haiku. If I can't test it, we may as well close the ticket. Don't know when I will have time to try. Please leave open for now.

comment:25 Changed 5 months ago by waddlesplash

Well, even if you can't build said ancient revision, a similar test of building HaikuDepot on Haiku vs Linux may be worthwhile.

comment:26 Changed 5 months ago by stippi

Yes, but the meaningful comparission is between Haiku with the old scheduler and the current Haiku version, no? One could only extrapolate the old and new difference between Linux and Haiku, taking into account the difference between Linux then and now... :-) It wouldn't be a test quite as robust as simply comparing the two Haiku versions on the same hardware doing the exact same task.

comment:27 Changed 5 months ago by stippi

... even the gcc versions would need to be the same, actually.

Note: See TracTickets for help on using tickets.