Opened 5 years ago
Last modified 3 years ago
#15509 new bug
Haiku thread switching speed is about 4 times slower than Windows 7
Reported by: | X512 | Owned by: | nobody |
---|---|---|---|
Priority: | normal | Milestone: | Unscheduled |
Component: | System/Kernel | Version: | R1/Development |
Keywords: | Cc: | ||
Blocked By: | Blocking: | ||
Platform: | All |
Description (last modified by )
This is Haiku hrev53608 32 bit gcc2hybrid, real hardware.
I made some benchmarks for Haiku and Windows. Benchmark using 2 threads: main thread and working thread. In each step main thread send request to worker thread and wait for reply of working thread. Working thread does simple processing and send reply. Benchmark result is number of steps per second. According profiler command, about 90% of execution is inside switch_sem_etc and release_sem_etc of kernel_x86. This functions probably should be optimized.
Also single core mode is about 3 times more efficient than multi core, but this behavior is similar to Windows.
I think that fast thread switching speed is required for Haiku because Haiku is actively using threads, for example thread pair in application and app_server for each window.
Attachments (11)
Change History (33)
by , 5 years ago
Attachment: | ThreadTests.cpp added |
---|
by , 5 years ago
Attachment: | ThreadTests (1 core).prof added |
---|
Running "profiler -a ThreadTests" with only one core enabled.
by , 5 years ago
Attachment: | ThreadTests (2 cores).prof added |
---|
Running "profiler -a ThreadTests" with all cores enabled. Unrelated threads removed.
comment:1 by , 5 years ago
Description: | modified (diff) |
---|
comment:2 by , 5 years ago
32 bit gcc2hybrid, real hardware.
GCC 2 is a very old compiler that does not have good optimizations. Windows is of course compiled with a much newer compiler that can optimize significantly more, so even before actual code behavior is taken into account, this is a significant difference. Can you please retest on 64-bit Haiku?
by , 5 years ago
Attachment: | ThreadTests (64 bit, 1 core).prof added |
---|
Running "profiler -a ThreadTests" with only one core enabled. Unrelated threads removed.
by , 5 years ago
Attachment: | ThreadTests (64 bit, 4 cores).prof added |
---|
Running "profiler -a ThreadTests" with all cores enabled. Unrelated threads removed.
comment:3 by , 5 years ago
I tested on another machine with 64 bit CPU and Windows 10 installed. Now difference is significantly smaller. Windows 10 is about 1.1 - 1.25 times faster than Haiku. Multi core is about 2.5 times slower than single core. It will be also good to test 32 bit gcc4+ Haiku.
comment:4 by , 5 years ago
I am also interested why multi core is significantly slower than single core. Is it some kind of hardware overhead of interaction between cores?
follow-up: 7 comment:5 by , 5 years ago
Semaphores are not the right tool to use on Haiku for this, because they are system-global (any other process can acquire and release them). Try your benchmark with pthread mutexes or a BLocker and you will get different results.
And yes, there are optimizations made when both threads are run on the same core. No cache invalidation problems, and there are many shortcuts one can take with atomic accesses and the like, both at the hardware and at the software level.
The performance problems of semaphores have been known back in BeOS as well, leading to constructs like the benaphore, which you could probably make use of as well: https://www.haiku-os.org/legacy-docs/benewsletter/Issue1-26.html#Engineering1-26
follow-up: 8 comment:6 by , 5 years ago
Since it hasn't been mentioned here: did you build Haiku without debugging turned on? You would have to turn it off actively, as the default is a kernel debug build.
follow-up: 9 comment:7 by , 5 years ago
Replying to pulkomandy:
Semaphores are not the right tool to use on Haiku for this, because they are system-global (any other process can acquire and release them). Try your benchmark with pthread mutexes or a BLocker and you will get different results.
pthread_mutex version is not working. pthread_mutex seems not support double lock and release from another thread. <semaphore.h> version is slower than native semaphore.
comment:8 by , 5 years ago
Replying to axeld:
Since it hasn't been mentioned here: did you build Haiku without debugging turned on? You would have to turn it off actively, as the default is a kernel debug build.
This is default nightly-raw build. What options should I use for release build? I can't find debug/release options in https://www.haiku-os.org/guides/building/.
comment:9 by , 5 years ago
Replying to X512:
pthread_mutex version is not working. pthread_mutex seems not support double lock and release from another thread.
BLocker currently allows this, so I would suggest to use that. I think it implements the "benaphore" trick I linked above.
follow-up: 12 comment:10 by , 5 years ago
Except, the "Benaphore" trick is only an optimization if the lock/semaphore will be often acquired and released by the same thread. In this case of two (or more) threads, the Benaphore is just a slight extra overhead and will not be an optimization at all...
comment:11 by , 5 years ago
This is default nightly-raw build. What options should I use for release build?
You need to change the KDEBUG_LEVEL to 0, in https://xref.plausible.coop/source/xref/haiku/build/config_headers/kernel_debug_config.h#8
comment:12 by , 5 years ago
Replying to waddlesplash:
Except, the "Benaphore" trick is only an optimization if the lock/semaphore will be often acquired and released by the same thread. In this case of two (or more) threads, the Benaphore is just a slight extra overhead and will not be an optimization at all...
It is an optimization for low-contention cases, which is any mutex that spends more time being released than locked. This is pretty much the case when you use a mutex for protecting a critical section, and not for waiting on another thread (for which a condition variable would make more sense). It also optimizes the no-lock path by saving a syscall whenever the mutex is "obviously" not locked, and is slower only when there is contention anyway (basically in the case where if you had attempted to lock a few CPU cycles earlier, you would have been waiting anyway).
So yes, maybe it doesn't help much in mutex benchmarks, but in all other cases, it's quite benefical.
comment:14 by , 5 years ago
Can't build Haiku with gcc4+ compiler for 32 bit x86 architecture.
jam -q @nightly-raw output:
DownloadLocatedFile1 download/gcc_syslibs_devel-5.4.0_2016_06_04-4-x86.hpkg --2019-12-05 03:39:08-- https://eu.hpkg.haiku-os.org/haikuports/master/build-packages/6efb4aa89c1ea031aaa55b6d9925f30d61ab1c94c1450660cbd3cb9aa78d57d1/packages/gcc_syslibs_devel-5.4.0_2016_06_04-4-x86.hpkg eu.hpkg.haiku-os.org をDNSに問いあわせています... 104.248.198.131 eu.hpkg.haiku-os.org|104.248.198.131|:443 に接続しています... 接続しました。 HTTP による接続要求を送信しました、応答を待っています... 404 Not Found 2019-12-05 03:39:10 エラー 404: Not Found。
comment:16 by , 5 years ago
Replying to waddlesplash:
Yes, we don't support this anymore. Just use x86_64.
In such case perfomance issue should be fixed in gcc2 build because 32 bit x86 is Haiku R1 target.
follow-up: 19 comment:18 by , 5 years ago
If you are comparing 64-bit Windows 7 and 32-bit Haiku, that's not a fair comparison, 64-bit will of course be faster due to the compilers usage of SSE2. And GCC2 is as noted a worse compiler, so again, I'm not sure there's much we can fix for 32-bit here.
comment:19 by , 5 years ago
Replying to waddlesplash:
If you are comparing 64-bit Windows 7 and 32-bit Haiku, that's not a fair comparison, 64-bit will of course be faster due to the compilers usage of SSE2.
This is 32 bit only PC. Windows 7 is 32 bit.
And GCC2 is as noted a worse compiler, so again, I'm not sure there's much we can fix for 32-bit here.
There is a chance, that code related to thread switching can be optimized for GCC2.
comment:20 by , 5 years ago
Tests made on 64 bit PC also shows that Haiku is slower than Windows, so there is room for optimization even for modern GCC.
comment:21 by , 5 years ago
We have no intention of trying to optimize the GCC2 builds.
No, the 64bit nightlies are as already mentioned debug builds. Running under a non KDEBUG build will probably close the remaining gap.
comment:22 by , 3 years ago
Retested on hrev55486:
x86 4 cores: 42831.82729
x86_64 4 cores: 48412.98648
x86 1 core: 296494.5128
x86_64 1 core: 420672.8393
Benchmark code for Haiku.