Opened 10 years ago

Closed 10 years ago

#11482 closed bug (fixed)

pthreads: possible race condition leading to deadlock

Reported by: jessicah Owned by: axeld
Priority: normal Milestone: R1
Component: System/libroot.so Version: R1/Development
Keywords: Cc:
Blocked By: Blocking:
Platform: All

Description

I'm working on upstreaming Haiku support for Boost, and am running into a reproducible deadlock for the Boost.Interprocess module.

My current work can be found at https://github.com/jessicah/boost

I think git clone --recursive https://github.com/jessicah/boost.git should do the right thing. Else you'll also need to grab the build, config, predef, thread, filesystem, and interprocess submodules from my GitHub as well.

Steps to reproduce:

./bootstrap.sh
./b2 --without-mpi --enable-parallel-mark inlining=on threading=multi variant=debug link=static,shared runtime-link=shared --without-python -j<N>
cd libs/interprocess
../../b2 --without-mpi --enable-parallel-mark inlining=on threading=multi variant=debug link=static,shared runtime-link=shared --without-python -j<N> -a -q test

Eventually, several tests will end up deadlocked, these are condition_test, condition_any_test, named_condition_test, and named_condition_any_test.

If I attach Debugger to any of these tests, I can break the deadlock by debugging all currently running threads, then resuming the test thread (this has the pthread_join call in the stack trace), then resuming the other threads. If I instead resume the other threads first, the deadlock remains.

The named tests sometimes require repeating the process, but will eventually resume.

Attachments (2)

syslog.txt (7.3 KB ) - added by jessicah 10 years ago.
kernel debug output during deadlock
0001-user-mutex-dequeue-waiters-when-waking-them-up.patch (3.2 KB ) - added by hamish 10 years ago.

Download all attachments as: .zip

Change History (8)

by jessicah, 10 years ago

Attachment: syslog.txt added

kernel debug output during deadlock

comment:1 by jessicah, 10 years ago

Mm, might not even be in pthread_join, but the condition variables themselves. Hitting debug/run in Debugger for each pthread_func thread releases them.

Attached some KDL output in case this might help. I can run further tests if required.

Also, shouldn't LD_PRELOAD=/system/lib/x86/libroot_debug.so <command to run> give me debug symbols in Debugger for functions like pthread_join, etc? Or do I need to rebuild Haiku with extra debugging options enabled?

comment:2 by pulkomandy, 10 years ago

Mh, github with submodules makes it hard to track the changes :(

libroot_debug does not come with debugging information. It's a version of libroot with extra debug checks (guarded memory allocator, etc).

To compile libroot with debug information, you need to add this to build/jam/UserBuildConfig:

 SetConfigVar DEBUG : HAIKU_TOP src system libroot : 1 : global ;

Then recompile it. The library is output in generated/objects/haiku/x86_gcc2/debug_1/ (and is also made part of the built haiku image).

comment:3 by jessicah, 10 years ago

The only thing I can see is that all three threads are blocked, calling thread_block. Since none of them use thread_block_with_timeout, I don't really see how any of the threads could possibly hope to progress, which would explain the deadlock. That's about all the sense I can make out of the situation.

The same tests under Linux don't exhibit this behaviour, FYI. The tests there all succeed without issue.

comment:4 by hamish, 10 years ago

patch: 01

comment:5 by hamish, 10 years ago

This seems to be a missed wake-up problem. In the current user_mutex implementation, the waking thread wakes the waiting one, but leaves it on the queue. The woken-up thread eventually runs and dequeues itself. For condition variables, this means multiple signals can just keep on waking up the same thread, until that thread eventually runs and dequeues itself.

Attached is a patch to address the issue (dequeue at the waker side instead; a waiter only dequeues itself if it was interrupted or timed out). Fixes the Boost condvar tests, and the POSIX test suite tests still pass. I would appreciate if someone else could take a look and tell me if I'm doing something wrong here though :)

comment:6 by jessicah, 10 years ago

Resolution: fixed
Status: newclosed

Applied by hamishm in hrev49149.

Note: See TracTickets for help on using tickets.