#13081 closed bug (fixed)
Can only run any program a limited number of times before hanging
Reported by: | AGMS | Owned by: | mmlr |
---|---|---|---|
Priority: | normal | Milestone: | R1/beta2 |
Component: | System | Version: | R1/Development |
Keywords: | relocate_image, system() | Cc: | agmsmith@… |
Blocked By: | Blocking: | ||
Platform: | x86 |
Description
If you repeatedly run a program, whether via system() or load_image(), you only get a few hundred runs before the program hangs. Or rather, it seems the child program hangs in relocate_image while it is being loaded, before it's main thread starts.
See the attached stress test program example (redirect standard output to a file to make it more readable). It repeatedly runs /bin/date, with a pause between iterations. Expect to get less than 1000 runs.
This may explain why Haiku gets stuck if left running long enough.
Tested in hrev50678, also affects earlier versions.
Attachments (4)
Change History (22)
by , 8 years ago
Attachment: | SystemForkTest.cpp added |
---|
by , 8 years ago
Attachment: | RelocateImageBug.jpeg added |
---|
Debugger running on the child /bin/date, showing where it's stuck during load.
comment:1 by , 8 years ago
It happens less often on real hardware. I was using VirtualBox on a Linux system with background CPU usage at full use by another program (running at a "nice" low priority).
I found that running about 10 parallel tests (10 Terminal windows open, stdout redirected to /dev/null) does make it happen faster on real hardware. Same failure point in relocate_image.
By the way, is program loading single threaded? I can open several windows to run the test on a 4 core machine, but it maxes out at 1 core equivalent of CPU usage.
It would also be nice if load_image() returned after you kill a partially loaded but not yet running dead child program. Right now the parent remains stuck even after the child is killed and disappears from the ProcessController thread listing.
comment:2 by , 8 years ago
I'm not sure it's related, but there are also millions of page faults happening. But no hard drive activity.
follow-up: 4 comment:3 by , 8 years ago
I failed to reproduce on hrev50773 x86_64, qemu 2CPUs 10 parallel tests (system() and load_image()), 1000 runs.
comment:4 by , 8 years ago
Replying to korli:
I failed to reproduce on hrev50773 x86_64, qemu 2CPUs 10 parallel tests (system() and load_image()), 1000 runs.
You're right, I couldn't get it to happen in x86_64 either. Though the forking/execute program speed is slow, over 10 minutes, with 10 windows open, got up to 8000ish iterations, so that's 80000 forks in 10 minutes or 8000 forks per minute or 133 forks per second.
With 32 bit Haiku, it's a bit faster (10000 interations in 10 minutes, or 166 forks per second), until it stops working (first one cut out at just over 5000 iterations on a bare hardware test, hrev50765, Intel 4820K i7 4 core CPU at 3.7Ghz).
And with a recent Linux on the same hardware, it zips along at 9000 forks per second.
follow-up: 6 comment:5 by , 8 years ago
Platform: | All → x86 |
---|
You write "It happens less often on real hardware", I assume it's easier to reproduce in a virtual machine, or not?
comment:6 by , 8 years ago
It's easier to recreate in a VM than on real hardware, probably because it's slower and that reveals a race condition in a shorter time. Even easier if the host computer is running other programs in the background which slows it down more.
comment:7 by , 6 years ago
Just confirmed that the bug is still there in hrev52139, using gcc2 on real hardware.
comment:8 by , 6 years ago
In hrev52689 GCC2 it happens differently this time. Rather than dying in relocate_image, it stops in load_image. Seems to be running, but there's no output happening. It also seemed to happen less often (just one process died, not more). But then I was running with 1 CPU core in VirtualBox (I'll test on real hardware with 4 cores soon). I did a KDL backtrace on it, and a debugger trace on the syscall (see attached screen shots). Oddly, it was able to start running again by using the debugger Run button. Maybe this one is some sort of scheduler bug?
by , 6 years ago
Attachment: | 20181226 Debugger 13081 Screen Shot.png added |
---|
User land debugger showing the syscall of the stuck team.
by , 6 years ago
Attachment: | 20181226 KDL 13081 Screen Shot.jpeg added |
---|
Backtrace in KDL of the stuck process.
comment:9 by , 6 years ago
I can confirm that it happens on real hardware too, with the same stall in scheduling (hitting Run in the debugger gets it going again). Had it happen to 2 out of 12 parallel tests. 4 core CPU.
comment:10 by , 6 years ago
That's suspending here: http://xref.plausible.coop/source/xref/haiku/src/system/kernel/team.cpp#1842
It's strange that this occurs on GCC2 but not GCC7. It would be interesting to see a disassembly of this function; it's possible GCC2 is incorrectly optimizing it in some way.
comment:12 by , 6 years ago
This looks like a normal race condition between the evaluation of the done flag and the thread suspend. There is nothing to prevent the done flag from flipping after the evaluation but just before the thread suspends itself.
That theory could be verified by checking that the done flag is in fact true already while the waiting thread is suspended.
The proper solution would be to use a condition variable which deals with such cases.
comment:13 by , 6 years ago
Ah, I see; the thread which sets the flag to done is supposed to wake this one up, and if it finishes before this thread suspends itself, then we have a race condition. A condition variable sounds like the correct solution indeed.
comment:14 by , 6 years ago
Owner: | changed from | to
---|---|
Status: | new → in-progress |
comment:15 by , 6 years ago
I've reproduced and validated that theory. Proposed fix pushed to gerrit here: https://review.haiku-os.org/#/c/haiku/+/803
comment:16 by , 6 years ago
Resolution: | → fixed |
---|---|
Status: | in-progress → closed |
Fix merged in hrev52691.
comment:17 by , 6 years ago
Tested in hrev52692 and it now works (gcc2, 4 core real hardware, parallel running 6 tests of load_image and 6 of system, after an hour of testing none halted). Not too surprisingly, load_image() is twice as fast as system() in running child programs.
comment:18 by , 5 years ago
Milestone: | Unscheduled → R1/beta2 |
---|
Assign tickets with status=closed and resolution=fixed within the R1/beta2 development window to the R1/beta2 Milestone
Stress test program for demonstrating the bug.