Opened 4 years ago

Last modified 3 years ago

#15509 new bug

Haiku thread switching speed is about 4 times slower than Windows 7

Reported by: X512 Owned by: nobody
Priority: normal Milestone: Unscheduled
Component: System/Kernel Version: R1/Development
Keywords: Cc:
Blocked By: Blocking:
Platform: All

Description (last modified by X512)

This is Haiku hrev53608 32 bit gcc2hybrid, real hardware.

I made some benchmarks for Haiku and Windows. Benchmark using 2 threads: main thread and working thread. In each step main thread send request to worker thread and wait for reply of working thread. Working thread does simple processing and send reply. Benchmark result is number of steps per second. According profiler command, about 90% of execution is inside switch_sem_etc and release_sem_etc of kernel_x86. This functions probably should be optimized.

Also single core mode is about 3 times more efficient than multi core, but this behavior is similar to Windows.

I think that fast thread switching speed is required for Haiku because Haiku is actively using threads, for example thread pair in application and app_server for each window.

Attachments (11)

ThreadTests.cpp (1.2 KB ) - added by X512 4 years ago.
Benchmark code for Haiku.
Main.c (1.4 KB ) - added by X512 4 years ago.
Benchmark code for Windows.
ThreadTests (1 core).prof (1.7 KB ) - added by X512 4 years ago.
Running "profiler -a ThreadTests" with only one core enabled.
ThreadTests (2 cores).prof (2.7 KB ) - added by X512 4 years ago.
Running "profiler -a ThreadTests" with all cores enabled. Unrelated threads removed.
Benchmark.png (3.4 KB ) - added by X512 4 years ago.
Benchmark bar graph.
Benchmark.txt (460 bytes ) - added by X512 4 years ago.
Benchmark results.
Benchmark 64 bit.png (3.1 KB ) - added by X512 4 years ago.
Benchmark bar graph.
ThreadTests (64 bit, 1 core).prof (1.3 KB ) - added by X512 4 years ago.
Running "profiler -a ThreadTests" with only one core enabled. Unrelated threads removed.
ThreadTests (64 bit, 4 cores).prof (1.5 KB ) - added by X512 4 years ago.
Running "profiler -a ThreadTests" with all cores enabled. Unrelated threads removed.
image.png (5.5 KB ) - added by X512 3 years ago.
PC, hrev55486
TeamTests.cpp (1.3 KB ) - added by X512 4 months ago.
Context switch benchmark between 2 teams.

Download all attachments as: .zip

Change History (33)

by X512, 4 years ago

Attachment: ThreadTests.cpp added

Benchmark code for Haiku.

by X512, 4 years ago

Attachment: Main.c added

Benchmark code for Windows.

by X512, 4 years ago

Attachment: ThreadTests (1 core).prof added

Running "profiler -a ThreadTests" with only one core enabled.

by X512, 4 years ago

Attachment: ThreadTests (2 cores).prof added

Running "profiler -a ThreadTests" with all cores enabled. Unrelated threads removed.

by X512, 4 years ago

Attachment: Benchmark.png added

Benchmark bar graph.

comment:1 by X512, 4 years ago

Description: modified (diff)

comment:2 by waddlesplash, 4 years ago

32 bit gcc2hybrid, real hardware.

GCC 2 is a very old compiler that does not have good optimizations. Windows is of course compiled with a much newer compiler that can optimize significantly more, so even before actual code behavior is taken into account, this is a significant difference. Can you please retest on 64-bit Haiku?

by X512, 4 years ago

Attachment: Benchmark.txt added

Benchmark results.

by X512, 4 years ago

Attachment: Benchmark 64 bit.png added

Benchmark bar graph.

by X512, 4 years ago

Running "profiler -a ThreadTests" with only one core enabled. Unrelated threads removed.

by X512, 4 years ago

Running "profiler -a ThreadTests" with all cores enabled. Unrelated threads removed.

comment:3 by X512, 4 years ago

I tested on another machine with 64 bit CPU and Windows 10 installed. Now difference is significantly smaller. Windows 10 is about 1.1 - 1.25 times faster than Haiku. Multi core is about 2.5 times slower than single core. It will be also good to test 32 bit gcc4+ Haiku.

comment:4 by X512, 4 years ago

I am also interested why multi core is significantly slower than single core. Is it some kind of hardware overhead of interaction between cores?

Last edited 4 years ago by X512 (previous) (diff)

comment:5 by pulkomandy, 4 years ago

Semaphores are not the right tool to use on Haiku for this, because they are system-global (any other process can acquire and release them). Try your benchmark with pthread mutexes or a BLocker and you will get different results.

And yes, there are optimizations made when both threads are run on the same core. No cache invalidation problems, and there are many shortcuts one can take with atomic accesses and the like, both at the hardware and at the software level.

The performance problems of semaphores have been known back in BeOS as well, leading to constructs like the benaphore, which you could probably make use of as well: https://www.haiku-os.org/legacy-docs/benewsletter/Issue1-26.html#Engineering1-26

comment:6 by axeld, 4 years ago

Since it hasn't been mentioned here: did you build Haiku without debugging turned on? You would have to turn it off actively, as the default is a kernel debug build.

in reply to:  5 ; comment:7 by X512, 4 years ago

Replying to pulkomandy:

Semaphores are not the right tool to use on Haiku for this, because they are system-global (any other process can acquire and release them). Try your benchmark with pthread mutexes or a BLocker and you will get different results.

pthread_mutex version is not working. pthread_mutex seems not support double lock and release from another thread. <semaphore.h> version is slower than native semaphore.

in reply to:  6 comment:8 by X512, 4 years ago

Replying to axeld:

Since it hasn't been mentioned here: did you build Haiku without debugging turned on? You would have to turn it off actively, as the default is a kernel debug build.

This is default nightly-raw build. What options should I use for release build? I can't find debug/release options in https://www.haiku-os.org/guides/building/.

in reply to:  7 comment:9 by pulkomandy, 4 years ago

Replying to X512:

pthread_mutex version is not working. pthread_mutex seems not support double lock and release from another thread.

BLocker currently allows this, so I would suggest to use that. I think it implements the "benaphore" trick I linked above.

comment:10 by waddlesplash, 4 years ago

Except, the "Benaphore" trick is only an optimization if the lock/semaphore will be often acquired and released by the same thread. In this case of two (or more) threads, the Benaphore is just a slight extra overhead and will not be an optimization at all...

comment:11 by waddlesplash, 4 years ago

This is default nightly-raw build. What options should I use for release build?

You need to change the KDEBUG_LEVEL to 0, in https://xref.plausible.coop/source/xref/haiku/build/config_headers/kernel_debug_config.h#8

in reply to:  10 comment:12 by pulkomandy, 4 years ago

Replying to waddlesplash:

Except, the "Benaphore" trick is only an optimization if the lock/semaphore will be often acquired and released by the same thread. In this case of two (or more) threads, the Benaphore is just a slight extra overhead and will not be an optimization at all...

It is an optimization for low-contention cases, which is any mutex that spends more time being released than locked. This is pretty much the case when you use a mutex for protecting a critical section, and not for waiting on another thread (for which a condition variable would make more sense). It also optimizes the no-lock path by saving a syscall whenever the mutex is "obviously" not locked, and is slower only when there is contention anyway (basically in the case where if you had attempted to lock a few CPU cycles earlier, you would have been waiting anyway).

So yes, maybe it doesn't help much in mutex benchmarks, but in all other cases, it's quite benefical.

comment:13 by waddlesplash, 4 years ago

Of course, I was just remarking about this one case.

comment:14 by X512, 4 years ago

Can't build Haiku with gcc4+ compiler for 32 bit x86 architecture.

jam -q @nightly-raw output:

DownloadLocatedFile1 download/gcc_syslibs_devel-5.4.0_2016_06_04-4-x86.hpkg 
--2019-12-05 03:39:08--  https://eu.hpkg.haiku-os.org/haikuports/master/build-packages/6efb4aa89c1ea031aaa55b6d9925f30d61ab1c94c1450660cbd3cb9aa78d57d1/packages/gcc_syslibs_devel-5.4.0_2016_06_04-4-x86.hpkg
eu.hpkg.haiku-os.org をDNSに問いあわせています... 104.248.198.131
eu.hpkg.haiku-os.org|104.248.198.131|:443 に接続しています... 接続しました。
HTTP による接続要求を送信しました、応答を待っています... 404 Not Found
2019-12-05 03:39:10 エラー 404: Not Found。

comment:15 by waddlesplash, 4 years ago

Yes, we don't support this anymore. Just use x86_64.

in reply to:  15 comment:16 by X512, 4 years ago

Replying to waddlesplash:

Yes, we don't support this anymore. Just use x86_64.

In such case perfomance issue should be fixed in gcc2 build because 32 bit x86 is Haiku R1 target.

comment:17 by pulkomandy, 4 years ago

Both architectures will be supported for R1.

comment:18 by waddlesplash, 4 years ago

If you are comparing 64-bit Windows 7 and 32-bit Haiku, that's not a fair comparison, 64-bit will of course be faster due to the compilers usage of SSE2. And GCC2 is as noted a worse compiler, so again, I'm not sure there's much we can fix for 32-bit here.

in reply to:  18 comment:19 by X512, 4 years ago

Replying to waddlesplash:

If you are comparing 64-bit Windows 7 and 32-bit Haiku, that's not a fair comparison, 64-bit will of course be faster due to the compilers usage of SSE2.

This is 32 bit only PC. Windows 7 is 32 bit.

And GCC2 is as noted a worse compiler, so again, I'm not sure there's much we can fix for 32-bit here.

There is a chance, that code related to thread switching can be optimized for GCC2.

comment:20 by X512, 4 years ago

Tests made on 64 bit PC also shows that Haiku is slower than Windows, so there is room for optimization even for modern GCC.

comment:21 by waddlesplash, 4 years ago

We have no intention of trying to optimize the GCC2 builds.

No, the 64bit nightlies are as already mentioned debug builds. Running under a non KDEBUG build will probably close the remaining gap.

by X512, 3 years ago

Attachment: image.png added

PC, hrev55486

comment:22 by X512, 3 years ago

Retested on hrev55486:

x86 4 cores: 42831.82729
x86_64 4 cores: 48412.98648
x86 1 core: 296494.5128
x86_64 1 core: 420672.8393

by X512, 4 months ago

Attachment: TeamTests.cpp added

Context switch benchmark between 2 teams.

Note: See TracTickets for help on using tickets.