Opened 2 years ago
Closed 2 years ago
#17847 closed bug (fixed)
Partial system freeze following scheduler changes
Reported by: | kim1963 | Owned by: | nobody |
---|---|---|---|
Priority: | blocker | Milestone: | R1/beta4 |
Component: | System/Kernel | Version: | R1/Development |
Keywords: | Cc: | ||
Blocked By: | Blocking: | ||
Platform: | x86-64 |
Description
Unstable work OS. Info - https://discuss.haiku-os.org/t/hrev56276-and-later-unstable-work/12364
Hrev56272 work fine!
Attachments (6)
Change History (41)
comment:1 by , 2 years ago
follow-up: 3 comment:2 by , 2 years ago
Please drop to KDL, get a backtrace of a hung thread, and post that here.
comment:3 by , 2 years ago
Replying to waddlesplash:
Please drop to KDL, get a backtrace of a hung thread, and post that here.
This is impossible to do. The bug occurs in a random window of any working application. The hovering window cannot be killed with the regular tools of the operating system.
The only method for confirming the bug in 56274 is proposed by me above.
"It is advisable to cancel changes in 56274 and re-check the presence of an error in the operating system."
follow-up: 5 comment:4 by , 2 years ago
The hovering window cannot be killed with the regular tools of the operating system.
That's why Waddlesplash said to use the kernel debugger, which should still be usable even in that situation.
comment:5 by , 2 years ago
Replying to pulkomandy:
The hovering window cannot be killed with the regular tools of the operating system.
That's why Waddlesplash said to use the kernel debugger, which should still be usable even in that situation.
I can’t help developers in eliminating this critical bug. I do not have sufficient qualifications to do KDL and backtrасе.
comment:6 by , 2 years ago
There is a non-level probability that the bug in the APP server is caused by changes in 56273, 275, 276.
comment:7 by , 2 years ago
Right now, once again, the problem poured 56301 - after 30-40 minutes, the bug caught, the hanging window, the others programs do not start from the Launchbox completely with the windows, the program will display in the Deskbar, but without a window - the standard shutdown from Deskbar started, but did not work ... it depended on ... . Only the power button allowed to turn off the computer.
comment:8 by , 2 years ago
I am trying to convey to the developers all the tragedy of the situation from the point of view of the user of the operational system.
follow-up: 11 comment:9 by , 2 years ago
Without the KDL backtrace we cannot do anything. It's not a question of "tragedy" or I don't know what, unfortunately this does not happen on all machines so we have to find someone who can reproduce *and* investigate the problem. Otherwise we can't do anything.
comment:10 by , 2 years ago
Milestone: | Unscheduled → R1/beta4 |
---|
comment:11 by , 2 years ago
Replying to pulkomandy:
Without the KDL backtrace we cannot do anything. It's not a question of "tragedy" or I don't know what, unfortunately this does not happen on all machines so we have to find someone who can reproduce *and* investigate the problem. Otherwise we can't do anything.
I can confirm that the unstable work of the operating system 56276+ is confirmed by another user. Instability comes to the point that the computer keyboard does not work, and the user is forced to turn off the computer. It is impossible to fulfill your requirements for KDL and backtrace.
comment:13 by , 2 years ago
The keyboard shortcut Alt+SysRq+D
should still work to enter KDL even if keyboard input does not work normally. You may need a PS/2 keyboard however, USB ones may or may not work in this state, it varies. After you have entered KDL, type bt thread-id
, substituting the ID of a hung thread, to get the backtrace. If multiple users can confirm this behavior, hopefully one of them has a PS/2 keyboard and a machine with an input jack for it to be able to retrieve such a backtrace.
If none of that works, you can revert to a prior state via the bootloader until this is fixed.
by , 2 years ago
Attachment: | AE671865-7085-466B-91E8-B41ACAA5272B.jpeg added |
---|
by , 2 years ago
comment:16 by , 2 years ago
And for another freeze (all clients unredponsive in the app_server, but can move windows... cant shutdown system woth kdl anymore)
A bit more threads backtraced
comment:17 by , 2 years ago
Summary: | Regression in hrev56273 ... hrev56276 → Partial system freeze, regression in hrev56273 ... hrev56276 |
---|
comment:18 by , 2 years ago
I am on hrev56315, using x86_64. I built fish shell from their master branch and when I run it, certain operations cause it to lock up. When I check KDL both threads are stuck on reschedule. I assume it is related to this bug and hrev56274.
To reproduce build fish shell (it needs packages cmake, ncurses6_devel, and libpcre2_devel) and then run it from the build directory and then try:
source ~/config/settings/fish/config.fish
This 100% reliably locks up fish for me.
This also was causing lock-ups but it worked the last few times I tried it:
set -g fish_prompt_pwd_dir_length 0
comment:19 by , 2 years ago
I downgraded to hrev56272 and now the above command does not lock up fish.
comment:21 by , 2 years ago
The fish problem seems unrelated and is actually reproducible on hrev56272. That likely is an issue in the FIFO system and deserves a separate ticket.
follow-up: 24 comment:22 by , 2 years ago
Patch which may help with the problem: https://review.haiku-os.org/c/haiku/+/5520
As I am still unable to reproduce it, it is totally untested. I have more ideas of things to experiment with if that does not change anything.
comment:23 by , 2 years ago
comment:24 by , 2 years ago
Replying to waddlesplash:
I am still unable to reproduce it
I finally got the same kind of crashes this morning, or I think they are the same because I couldn't reproduce them with the change in hrev56274 reverted. I did so by disabling CPUs in the process manager (left 4 on down from 12, disabling SMP in the boot menu did not trigger the crashes), running git status in a loop in the background (uses quite a few threads) and playing with the font preferences in Vision like in #17850 (that I'm guessing it's also this same underlying bug).
Will try your new patch this evening.
comment:25 by , 2 years ago
Disabling CPUs is known to freeze the system randomly independently of recent changes, see #15100. So unless you cannot manage to reproduce that on an older build it is almost certainly an unrelated issue.
The Vision font changes may be related. I guess I can try to test with that myself later.
comment:26 by , 2 years ago
Older than that other bug or just a few weeks ago? I couldn't reproduce it reverting the invoke_scheduler flag change. I don't know if that's because I didn't try hard enough, of course, but it triggered quite fast with master.
Anyway, while trying to crash it without disabling CPUs by adding more threads blocking on filesystem operations, it triggered by:
- boot
- open terminal, cd into haiku sources (or some other big enough git repo) and
while true; do git status; done &
- open another terminal, fool around changing directories
This is less reliable, but still crashed more than half the times I tried.
comment:27 by , 2 years ago
OK. Can you try removing the conditions altogether, i.e. setting invoke_scheduler unconditionally at that point in the function, and see if that makes any difference?
comment:28 by , 2 years ago
I tried the Vision font selector and a bunch of other things and I still cannot reproduce the problem here.
I'm mostly running in VMware, but I tested on bare metal as well, and either way I could not incur any hangs.
comment:29 by , 2 years ago
Tested hrev56325 x64. Suspends a thread on my physical system - see Core 3 (added an attachment). Can resume if I mouse click between active/inactive windows. I can jump into KDL, run commands and then exit, then the thread resumes further - sometimes it does it again within 30 minutes and sometimes not again - or after several hours of a consistent haikuporter-based build project. I tested this by building both blender and qtwebengine with haikuporter (ie. long stress test condition). With smaller hp builds like blender, the threading issue may happen 1 out of 3-4 builds. With the qywebengine, I'll usually get it within 20-30 minutes after compiling starts.
If you use ProcessController, it may trigger a system freeze. No high memory/CPU/thread usage during this time when thread is pending/stalled.
comment:30 by , 2 years ago
OK, I think I finally managed to trigger this. The key is that you have to put the system in "Power saving" instead of "Low latency"/"High performance" mode. Then indeed I get the random lockups with 0% CPU usage and windows quit redrawing, but I can still move the mouse.
comment:31 by , 2 years ago
Summary: | Partial system freeze, regression in hrev56273 ... hrev56276 → Partial system freeze following scheduler changes |
---|
This is strange. The runqueues on my system in a frozen state are not empty, but the actually running threads on each core are the idle ones. Meanwhile the "idle_cores" command claims no packages are idle. So how did we wind up in this state, and how do we get out of it...?
comment:32 by , 2 years ago
Perhaps check with the condvar refactor reverted to see if there is some subtle bug introduced there?
follow-up: 34 comment:33 by , 2 years ago
I pushed a fix in hrev56332. At least I can't reproduce the problem anymore. Let me know if that fixes it for everyone else.
comment:34 by , 2 years ago
Replying to waddlesplash:
I pushed a fix in hrev56332. At least I can't reproduce the problem anymore. Let me know if that fixes it for everyone else.
hrev56332 work fine!
comment:35 by , 2 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
Version: | R1/beta3 → R1/Development |
It is advisable to cancel changes in 56274 and re-check the presence of an error in the operating system.