Opened 2 months ago

Closed 2 months ago

#17847 closed bug (fixed)

Partial system freeze following scheduler changes

Reported by: kim1963 Owned by: nobody
Priority: blocker Milestone: R1/beta4
Component: System/Kernel Version: R1/Development
Keywords: Cc:
Blocked By: Blocking:
Platform: x86-64

Description

Attachments (6)

AE671865-7085-466B-91E8-B41ACAA5272B.jpeg (3.9 MB ) - added by nephele 2 months ago.
image.jpg (4.3 MB ) - added by nephele 2 months ago.
5C93188F-93E6-41D6-B396-C3F2F8F4B12C.jpeg (3.6 MB ) - added by nephele 2 months ago.
System freeze (1/2)
DD691359-1423-45B2-91BE-B0C54BF54A7E.jpeg (4.3 MB ) - added by nephele 2 months ago.
System freeze (2/2)
fish_kdl.txt (18.1 KB ) - added by leavengood 2 months ago.
KDL session for locked up team
Core3_thread_pend_state-cocobean.jpg (113.0 KB ) - added by cocobean 2 months ago.
KDL run_queue info on CPUs

Change History (41)

comment:1 by kim1963, 2 months ago

It is advisable to cancel changes in 56274 and re-check the presence of an error in the operating system.

comment:2 by waddlesplash, 2 months ago

Please drop to KDL, get a backtrace of a hung thread, and post that here.

in reply to:  2 comment:3 by kim1963, 2 months ago

Replying to waddlesplash:

Please drop to KDL, get a backtrace of a hung thread, and post that here.

This is impossible to do. The bug occurs in a random window of any working application. The hovering window cannot be killed with the regular tools of the operating system.

The only method for confirming the bug in 56274 is proposed by me above.

"It is advisable to cancel changes in 56274 and re-check the presence of an error in the operating system."

Last edited 2 months ago by kim1963 (previous) (diff)

comment:4 by pulkomandy, 2 months ago

The hovering window cannot be killed with the regular tools of the operating system.

That's why Waddlesplash said to use the kernel debugger, which should still be usable even in that situation.

in reply to:  4 comment:5 by kim1963, 2 months ago

Replying to pulkomandy:

The hovering window cannot be killed with the regular tools of the operating system.

That's why Waddlesplash said to use the kernel debugger, which should still be usable even in that situation.

I can’t help developers in eliminating this critical bug. I do not have sufficient qualifications to do KDL and backtrасе.

comment:6 by kim1963, 2 months ago

There is a non-level probability that the bug in the APP server is caused by changes in 56273, 275, 276.

comment:7 by kim1963, 2 months ago

Right now, once again, the problem poured 56301 - after 30-40 minutes, the bug caught, the hanging window, the others programs do not start from the Launchbox completely with the windows, the program will display in the Deskbar, but without a window - the standard shutdown from Deskbar started, but did not work ... it depended on ... . Only the power button allowed to turn off the computer.

comment:8 by kim1963, 2 months ago

I am trying to convey to the developers all the tragedy of the situation from the point of view of the user of the operational system.

comment:9 by pulkomandy, 2 months ago

Without the KDL backtrace we cannot do anything. It's not a question of "tragedy" or I don't know what, unfortunately this does not happen on all machines so we have to find someone who can reproduce *and* investigate the problem. Otherwise we can't do anything.

comment:10 by pulkomandy, 2 months ago

Milestone: UnscheduledR1/beta4

in reply to:  9 comment:11 by kim1963, 2 months ago

Replying to pulkomandy:

Without the KDL backtrace we cannot do anything. It's not a question of "tragedy" or I don't know what, unfortunately this does not happen on all machines so we have to find someone who can reproduce *and* investigate the problem. Otherwise we can't do anything.

I can confirm that the unstable work of the operating system 56276+ is confirmed by another user. Instability comes to the point that the computer keyboard does not work, and the user is forced to turn off the computer. It is impossible to fulfill your requirements for KDL and backtrace.

comment:12 by kim1963, 2 months ago

HELP!!!!!!

comment:13 by waddlesplash, 2 months ago

The keyboard shortcut Alt+SysRq+D should still work to enter KDL even if keyboard input does not work normally. You may need a PS/2 keyboard however, USB ones may or may not work in this state, it varies. After you have entered KDL, type bt thread-id, substituting the ID of a hung thread, to get the backtrace. If multiple users can confirm this behavior, hopefully one of them has a PS/2 keyboard and a machine with an input jack for it to be able to retrieve such a backtrace.

If none of that works, you can revert to a prior state via the bootloader until this is fixed.

comment:14 by nephele, 2 months ago

This was with app_server not starting at all

by nephele, 2 months ago

Attachment: image.jpg added

comment:15 by nephele, 2 months ago

this was quaternion starting and then becoming unkillable

by nephele, 2 months ago

System freeze (1/2)

by nephele, 2 months ago

System freeze (2/2)

comment:16 by nephele, 2 months ago

And for another freeze (all clients unredponsive in the app_server, but can move windows... cant shutdown system woth kdl anymore)

A bit more threads backtraced

comment:17 by pulkomandy, 2 months ago

Summary: Regression in hrev56273 ... hrev56276Partial system freeze, regression in hrev56273 ... hrev56276

comment:18 by leavengood, 2 months ago

I am on hrev56315, using x86_64. I built fish shell from their master branch and when I run it, certain operations cause it to lock up. When I check KDL both threads are stuck on reschedule. I assume it is related to this bug and hrev56274.

To reproduce build fish shell (it needs packages cmake, ncurses6_devel, and libpcre2_devel) and then run it from the build directory and then try:

source ~/config/settings/fish/config.fish

This 100% reliably locks up fish for me.

This also was causing lock-ups but it worked the last few times I tried it:

set -g fish_prompt_pwd_dir_length 0

comment:19 by leavengood, 2 months ago

I downgraded to hrev56272 and now the above command does not lock up fish.

comment:20 by waddlesplash, 2 months ago

Does there need to be a config.fish file?

by leavengood, 2 months ago

Attachment: fish_kdl.txt added

KDL session for locked up team

comment:21 by waddlesplash, 2 months ago

The fish problem seems unrelated and is actually reproducible on hrev56272. That likely is an issue in the FIFO system and deserves a separate ticket.

comment:22 by waddlesplash, 2 months ago

Patch which may help with the problem: https://review.haiku-os.org/c/haiku/+/5520

As I am still unable to reproduce it, it is totally untested. I have more ideas of things to experiment with if that does not change anything.

in reply to:  22 comment:24 by madmax, 2 months ago

Replying to waddlesplash:

I am still unable to reproduce it

I finally got the same kind of crashes this morning, or I think they are the same because I couldn't reproduce them with the change in hrev56274 reverted. I did so by disabling CPUs in the process manager (left 4 on down from 12, disabling SMP in the boot menu did not trigger the crashes), running git status in a loop in the background (uses quite a few threads) and playing with the font preferences in Vision like in #17850 (that I'm guessing it's also this same underlying bug).

Will try your new patch this evening.

comment:25 by waddlesplash, 2 months ago

Disabling CPUs is known to freeze the system randomly independently of recent changes, see #15100. So unless you cannot manage to reproduce that on an older build it is almost certainly an unrelated issue.

The Vision font changes may be related. I guess I can try to test with that myself later.

comment:26 by madmax, 2 months ago

Older than that other bug or just a few weeks ago? I couldn't reproduce it reverting the invoke_scheduler flag change. I don't know if that's because I didn't try hard enough, of course, but it triggered quite fast with master.

Anyway, while trying to crash it without disabling CPUs by adding more threads blocking on filesystem operations, it triggered by:

  1. boot
  2. open terminal, cd into haiku sources (or some other big enough git repo) and while true; do git status; done &
  3. open another terminal, fool around changing directories

This is less reliable, but still crashed more than half the times I tried.

comment:27 by waddlesplash, 2 months ago

OK. Can you try removing the conditions altogether, i.e. setting invoke_scheduler unconditionally at that point in the function, and see if that makes any difference?

comment:28 by waddlesplash, 2 months ago

I tried the Vision font selector and a bunch of other things and I still cannot reproduce the problem here.

I'm mostly running in VMware, but I tested on bare metal as well, and either way I could not incur any hangs.

by cocobean, 2 months ago

KDL run_queue info on CPUs

comment:29 by cocobean, 2 months ago

Tested hrev56325 x64. Suspends a thread on my physical system - see Core 3 (added an attachment). Can resume if I mouse click between active/inactive windows. I can jump into KDL, run commands and then exit, then the thread resumes further - sometimes it does it again within 30 minutes and sometimes not again - or after several hours of a consistent haikuporter-based build project. I tested this by building both blender and qtwebengine with haikuporter (ie. long stress test condition). With smaller hp builds like blender, the threading issue may happen 1 out of 3-4 builds. With the qtwebengine, I'll usually get it within 20-30 minutes after compiling starts.

If you use ProcessController, it may trigger a system freeze. No high memory/CPU/thread usage during this time when thread is pending/stalled.

Last edited 2 months ago by cocobean (previous) (diff)

comment:30 by waddlesplash, 2 months ago

OK, I think I finally managed to trigger this. The key is that you have to put the system in "Power saving" instead of "Low latency"/"High performance" mode. Then indeed I get the random lockups with 0% CPU usage and windows quit redrawing, but I can still move the mouse.

Last edited 2 months ago by waddlesplash (previous) (diff)

comment:31 by waddlesplash, 2 months ago

Summary: Partial system freeze, regression in hrev56273 ... hrev56276Partial system freeze following scheduler changes

This is strange. The runqueues on my system in a frozen state are not empty, but the actually running threads on each core are the idle ones. Meanwhile the "idle_cores" command claims no packages are idle. So how did we wind up in this state, and how do we get out of it...?

comment:32 by tqh, 2 months ago

Perhaps check with the condvar refactor reverted to see if there is some subtle bug introduced there?

comment:33 by waddlesplash, 2 months ago

I pushed a fix in hrev56332. At least I can't reproduce the problem anymore. Let me know if that fixes it for everyone else.

in reply to:  33 comment:34 by kim1963, 2 months ago

Replying to waddlesplash:

I pushed a fix in hrev56332. At least I can't reproduce the problem anymore. Let me know if that fixes it for everyone else.

hrev56332 work fine!

comment:35 by korli, 2 months ago

Resolution: fixed
Status: newclosed
Version: R1/beta3R1/Development
Note: See TracTickets for help on using tickets.