Opened 3 years ago

Closed 11 months ago

Last modified 11 months ago

#13081 closed bug (fixed)

Can only run any program a limited number of times before hanging

Reported by: AGMS Owned by: mmlr
Priority: normal Milestone: Unscheduled
Component: System Version: R1/Development
Keywords: relocate_image, system() Cc: agmsmith@…
Blocked By: Blocking:
Has a Patch: no Platform: x86

Description

If you repeatedly run a program, whether via system() or load_image(), you only get a few hundred runs before the program hangs. Or rather, it seems the child program hangs in relocate_image while it is being loaded, before it's main thread starts.

See the attached stress test program example (redirect standard output to a file to make it more readable). It repeatedly runs /bin/date, with a pause between iterations. Expect to get less than 1000 runs.

This may explain why Haiku gets stuck if left running long enough.

Tested in hrev50678, also affects earlier versions.

Attachments (4)

SystemForkTest.cpp (1.1 KB ) - added by AGMS 3 years ago.
Stress test program for demonstrating the bug.
RelocateImageBug.jpeg (105.1 KB ) - added by AGMS 3 years ago.
Debugger running on the child /bin/date, showing where it's stuck during load.
20181226 Debugger 13081 Screen Shot.png (89.6 KB ) - added by AGMS 11 months ago.
User land debugger showing the syscall of the stuck team.
20181226 KDL 13081 Screen Shot.jpeg (213.9 KB ) - added by AGMS 11 months ago.
Backtrace in KDL of the stuck process.

Download all attachments as: .zip

Change History (21)

by AGMS, 3 years ago

Attachment: SystemForkTest.cpp added

Stress test program for demonstrating the bug.

by AGMS, 3 years ago

Attachment: RelocateImageBug.jpeg added

Debugger running on the child /bin/date, showing where it's stuck during load.

comment:1 by AGMS, 3 years ago

It happens less often on real hardware. I was using VirtualBox on a Linux system with background CPU usage at full use by another program (running at a "nice" low priority).

I found that running about 10 parallel tests (10 Terminal windows open, stdout redirected to /dev/null) does make it happen faster on real hardware. Same failure point in relocate_image.

By the way, is program loading single threaded? I can open several windows to run the test on a 4 core machine, but it maxes out at 1 core equivalent of CPU usage.

It would also be nice if load_image() returned after you kill a partially loaded but not yet running dead child program. Right now the parent remains stuck even after the child is killed and disappears from the ProcessController thread listing.

comment:2 by AGMS, 3 years ago

I'm not sure it's related, but there are also millions of page faults happening. But no hard drive activity.

comment:3 by korli, 3 years ago

I failed to reproduce on hrev50773 x86_64, qemu 2CPUs 10 parallel tests (system() and load_image()), 1000 runs.

in reply to:  3 comment:4 by AGMS, 3 years ago

Replying to korli:

I failed to reproduce on hrev50773 x86_64, qemu 2CPUs 10 parallel tests (system() and load_image()), 1000 runs.

You're right, I couldn't get it to happen in x86_64 either. Though the forking/execute program speed is slow, over 10 minutes, with 10 windows open, got up to 8000ish iterations, so that's 80000 forks in 10 minutes or 8000 forks per minute or 133 forks per second.

With 32 bit Haiku, it's a bit faster (10000 interations in 10 minutes, or 166 forks per second), until it stops working (first one cut out at just over 5000 iterations on a bare hardware test, hrev50765, Intel 4820K i7 4 core CPU at 3.7Ghz).

And with a recent Linux on the same hardware, it zips along at 9000 forks per second.

comment:5 by korli, 3 years ago

Platform: Allx86

You write "It happens less often on real hardware", I assume it's easier to reproduce in a virtual machine, or not?

in reply to:  5 comment:6 by AGMS, 3 years ago

It's easier to recreate in a VM than on real hardware, probably because it's slower and that reveals a race condition in a shorter time. Even easier if the host computer is running other programs in the background which slows it down more.

comment:7 by AGMS, 16 months ago

Just confirmed that the bug is still there in hrev52139, using gcc2 on real hardware.

comment:8 by AGMS, 11 months ago

In hrev52689 GCC2 it happens differently this time. Rather than dying in relocate_image, it stops in load_image. Seems to be running, but there's no output happening. It also seemed to happen less often (just one process died, not more). But then I was running with 1 CPU core in VirtualBox (I'll test on real hardware with 4 cores soon). I did a KDL backtrace on it, and a debugger trace on the syscall (see attached screen shots). Oddly, it was able to start running again by using the debugger Run button. Maybe this one is some sort of scheduler bug?

by AGMS, 11 months ago

User land debugger showing the syscall of the stuck team.

by AGMS, 11 months ago

Backtrace in KDL of the stuck process.

comment:9 by AGMS, 11 months ago

I can confirm that it happens on real hardware too, with the same stall in scheduling (hitting Run in the debugger gets it going again). Had it happen to 2 out of 12 parallel tests. 4 core CPU.

comment:10 by waddlesplash, 11 months ago

That's suspending here: http://xref.plausible.coop/source/xref/haiku/src/system/kernel/team.cpp#1842

It's strange that this occurs on GCC2 but not GCC7. It would be interesting to see a disassembly of this function; it's possible GCC2 is incorrectly optimizing it in some way.

comment:11 by AGMS, 11 months ago

Maybe need to declare team_loading_info.done as volatile?

Last edited 11 months ago by AGMS (previous) (diff)

comment:12 by mmlr, 11 months ago

This looks like a normal race condition between the evaluation of the done flag and the thread suspend. There is nothing to prevent the done flag from flipping after the evaluation but just before the thread suspends itself.

That theory could be verified by checking that the done flag is in fact true already while the waiting thread is suspended.

The proper solution would be to use a condition variable which deals with such cases.

comment:13 by waddlesplash, 11 months ago

Ah, I see; the thread which sets the flag to done is supposed to wake this one up, and if it finishes before this thread suspends itself, then we have a race condition. A condition variable sounds like the correct solution indeed.

comment:14 by mmlr, 11 months ago

Owner: changed from nobody to mmlr
Status: newin-progress

comment:15 by mmlr, 11 months ago

I've reproduced and validated that theory. Proposed fix pushed to gerrit here: https://review.haiku-os.org/#/c/haiku/+/803

comment:16 by waddlesplash, 11 months ago

Resolution: fixed
Status: in-progressclosed

Fix merged in hrev52691.

comment:17 by AGMS, 11 months ago

Tested in hrev52692 and it now works (gcc2, 4 core real hardware, parallel running 6 tests of load_image and 6 of system, after an hour of testing none halted). Not too surprisingly, load_image() is twice as fast as system() in running child programs.

Note: See TracTickets for help on using tickets.