Opened 2 years ago

Closed 2 years ago

Last modified 2 years ago

#17455 closed bug (fixed)

KDL in VirtualBox in ConditionVariableEntry::_RemoveFromVariable

Reported by: jmairboeck Owned by: nobody
Priority: normal Milestone: R1/beta4
Component: System/Kernel Version: R1/Development
Keywords: Cc:
Blocked By: #17444, #17686 Blocking:
Platform: All

Description

I just updated to hrev55706 and got this KDL when closing AboutSystem.

KERN: PANIC: variable pointer was not unset for a long time!
KERN: Welcome to Kernel Debugging Land...
KERN: Thread 621 "w:617:Über dieses System" running on CPU 1
KERN: stack trace for thread 621 "w:617:Über dieses System"
KERN:     kernel stack: 0xffffffff8070c000 to 0xffffffff80711000
KERN:       user stack: 0x00007f917f1f1000 to 0x00007f917f231000
KERN: frame                       caller             <image>:function + offset
KERN:  0 ffffffff80710ba8 (+  24) ffffffff8014444c   <kernel_x86_64> arch_debug_call_with_fault_handler + 0x16
KERN:  1 ffffffff80710bc0 (+  80) ffffffff800adb08   <kernel_x86_64> debug_call_with_fault_handler + 0x78
KERN:  2 ffffffff80710c10 (+  96) ffffffff800af123   <kernel_x86_64> kernel_debugger_loop(char const*, char const*, __va_list_tag*, int) + 0xf3
KERN:  3 ffffffff80710c70 (+  80) ffffffff800af4be   <kernel_x86_64> kernel_debugger_internal(char const*, char const*, __va_list_tag*, int) + 0x6e
KERN:  4 ffffffff80710cc0 (+ 240) ffffffff800af817   <kernel_x86_64> panic + 0xb7
KERN:  5 ffffffff80710db0 (+  80) ffffffff800562f7   <kernel_x86_64> ConditionVariableEntry::_RemoveFromVariable() + 0x107
KERN:  6 ffffffff80710e00 (+  64) ffffffff8005652e   <kernel_x86_64> ConditionVariableEntry::Wait(unsigned int, long) + 0xae
KERN:  7 ffffffff80710e40 (+ 144) ffffffff8006d81c   <kernel_x86_64> _get_port_message_info_etc + 0x11c
KERN:  8 ffffffff80710ed0 (+  80) ffffffff8006e92b   <kernel_x86_64> _user_port_buffer_size_etc + 0x4b
KERN:  9 ffffffff80710f20 (+  16) ffffffff80145f3f   <kernel_x86_64> x86_64_syscall_entry + 0xfb
KERN: user iframe at 0xffffffff80710f30 (end = 0xffffffff80710ff8)
KERN:  rax 0xdb                  rbx 0x7fffffffffffffff    rcx 0x4a3d1eb59c
KERN:  rdx 0x7fffffffffffffff    rsi 0x0                   rdi 0x123
KERN:  rbp 0x7f917f2304a0         r8 0x112351a3e7a0         r9 0x0
KERN:  r10 0x0                   r11 0x246                 r12 0x0
KERN:  r13 0x1123526eba90        r14 0x1123526eba90        r15 0x7fffffffffffffff
KERN:  rip 0x4a3d1eb59c          rsp 0x7f917f230478     rflags 0x246
KERN:  vector: 0x63, error code: 0x0
KERN: 10 ffffffff80710f30 (+140265020061040) 0000004a3d1eb59c   <libroot.so> _kern_port_buffer_size_etc + 0x0c
KERN: 11 00007f917f2304a0 (+  64) 000001fcdde2bed6   <libbe.so> BPrivate::LinkReceiver::ReadFromPort(long) + 0x26
KERN: 12 00007f917f2304e0 (+  32) 000001fcdde2be89   <libbe.so> BPrivate::LinkReceiver::GetNextMessage(int&, long) + 0x69
KERN: 13 00007f917f230500 (+ 112) 00000203136d948e   <_APP_> ServerWindow::_MessageLooper() + 0x12e
KERN: 14 00007f917f230570 (+  16) 00000203136bccea   <_APP_> MessageLooper::_message_thread(void*) + 0x0a
KERN: 15 00007f917f230580 (+  32) 0000004a3d1ea3a9   <libroot.so> _thread_do_exit_work (nearest) + 0x89
KERN: 16 00007f917f2305a0 (+   0) 00007fdc25053260   <commpage> commpage_thread_exit + 0x00
KERN: kdebug> co

The KDL was continuable and it seems to run fine now.

Is this related to #17444?

Attachments (3)

syslog (70.2 KB ) - added by jmairboeck 2 years ago.
syslog.2 (384.7 KB ) - added by jmairboeck 2 years ago.
syslog.3 (118.0 KB ) - added by jmairboeck 2 years ago.

Download all attachments as: .zip

Change History (25)

by jmairboeck, 2 years ago

Attachment: syslog added

comment:1 by waddlesplash, 2 years ago

Yes.

It seems thread_unblock is much more expensive than I'd hoped, and that it is still running at the point the other thread wakes up and tries to do things, leading to this assertion failure. I could increase the timeout, or perhaps there is something else I can do still...

comment:2 by jmairboeck, 2 years ago

My laptop's CPU doesn't run at full frequency (it fluctuates quite a bit) because of a somewhat broken fan, and I suppose the virtualization doesn't make it better. Using Linux as host OS.

comment:3 by waddlesplash, 2 years ago

To be honest, it is kind of concerning that thread_unblock is apparently significantly more expensive than 10000 atomic_pointer_gets, even after waking up the other thread.

comment:4 by waddlesplash, 2 years ago

Hopefully hrev55709 resolves this.

comment:5 by jmairboeck, 2 years ago

I finished building with the patch you sent yesterday and just got one such KDL again, but less frequent than before, it seems. I will now update to the official build and test again. Hopefully the increased timeout fixes this completely.

comment:6 by jmairboeck, 2 years ago

Unfortunately, it didn't seem to fix this completely. I just got one directly after booting hrev55709. :(

Last edited 2 years ago by jmairboeck (previous) (diff)

comment:7 by jmairboeck, 2 years ago

On the other hand, it now runs perfectly fine, after the initial ones directly after booting, even playing audio, and having Gerrit opened in Web+. Most of the KDLs came from the "Audio mixer" and "Websocket" threads before.

comment:8 by jmairboeck, 2 years ago

And as I pressed "submit" on the previous comment, I immediately got 3 more, all from the "Audio mixer", which shouldn't do anything right now ...

comment:9 by waddlesplash, 2 years ago

Can you attach a new syslog, please?

by jmairboeck, 2 years ago

Attachment: syslog.2 added

comment:10 by waddlesplash, 2 years ago

Blocked By: 17444 added

comment:11 by waddlesplash, 2 years ago

As you are in VirtualBox, I wonder if #14979 is the main "cause" here.

comment:12 by jmairboeck, 2 years ago

Using only a single virtual CPU seems to really avoid the issue reliably. I have not gotten any KDL yet in the last few hours. So these two issues really seem to be related, if I understand that correctly.

comment:13 by waddlesplash, 2 years ago

The KDL cannot occur at all when there is only one CPU, so, that would make sense that you don't see it. :)

comment:14 by waddlesplash, 2 years ago

Can you try enabling X2APIC for your guest VM, using VBoxManage modifyvm "name" --x2apic on (and then re-enabling more CPU cores) and see what happens? (Please check your syslog after doing this to make sure X2APIC is actually enabled.)

comment:15 by jmairboeck, 2 years ago

Still no luck with that option enabled. I see it occuring in the syslog a few times, but still got a KDL just now.

by jmairboeck, 2 years ago

Attachment: syslog.3 added

comment:16 by waddlesplash, 2 years ago

I have pushed a patch that may help here at https://review.haiku-os.org/c/haiku/+/4819. There are test builds available, could you install the appropriate one (should be https://haiku.movingborders.es/testbuild/I6e00286508c069705e07c9a0b59af2cf5e15e427/1/hrev55735/x86_64/haiku.hpkg) and see if that fixes things?

comment:17 by korli, 2 years ago

Please test with hrev55737 when it's available.

comment:18 by jmairboeck, 2 years ago

It seems that it is still not fully fixed (hrev55742).

comment:19 by waddlesplash, 2 years ago

How frequently does it occur, now? More or less than your last comments about it?

comment:20 by jmairboeck, 2 years ago

As you have reduced the timeout again also, it is definitely less frequent than before with the old timeout. But I think this depends quite heavily on the host CPU frequency, how often this occurs.

comment:21 by waddlesplash, 2 years ago

Milestone: UnscheduledR1/beta4
Resolution: fixed
Status: newclosed

Panic downgraded into a syslog print in hrev55804.

comment:22 by jmairboeck, 2 years ago

Blocked By: 17686 added
Note: See TracTickets for help on using tickets.