Opened 12 years ago

Closed 12 years ago

#8068 closed bug (fixed)

A crash bring up the app_server in gdb when changing the screen resolution

Reported by: oco Owned by: axeld
Priority: normal Milestone: R1
Component: Servers/app_server Version: R1/Development
Keywords: Cc:
Blocked By: Blocking:
Platform: All

Description (last modified by mmlr)

Reproducible on a fresh install after few change of the screen resolution (less than five) on my laptop (Vesa mode)

I use release hrev42926

Overview of the backtrace :

free
MallocBuffer::MallocBuffer
AccelerantHWInterface::SetMode
Screen::SetMode
Desktop::SetScreenMode
ServerApp::_DispatchMessage
ServerApp::_MessageLooper
...

See attached photo for detailled backtrace.

Attachments (1)

IMG_3808.JPG (2.4 MB ) - added by oco 12 years ago.

Change History (8)

by oco, 12 years ago

Attachment: IMG_3808.JPG added

comment:1 by diver, 12 years ago

Version: R1/alpha3R1/Development

comment:2 by mmlr, 12 years ago

Description: modified (diff)

I've seen this on another machine and it really is easily reproducible. I've added enough debug output to rule out that it comes from the place the stack trace would suggest. The buffer handling is fine, the free call is merely a victim of what's going on. My debug efforts showed that everything works as expected up to and including the call of fAccSetDisplayMode in AccelerantHWInterface::SetMode(). After that it seems like any access to libroot functions will fault (including the printf I added for debugging). I've tried to narrow it further down, but I can only suspect a side effect of the vm86 code to be the problem here. Since the code doesn't run through to update of the KDL framebuffer with the added debug output I wasn't able to gather more info just yet. I'll try to investigate further, but I don't really have a good overview of what should or should not happen within the vm86 code.

comment:3 by mmlr, 12 years ago

I've built a small script that gets the app_server reliably crashing after a couple of seconds when running normally:

while true
do
screenmode -q 1280 1024
screenmode -q 1024 768
done

Which will cause continuous mode switches until the crash happens.

Using that I've confirmed that this only happens when both CPUs (it's a dual core machine) are active. Disabling one core prevents the crash from happening. So I'd suspect that the vm86 stuff needs to run on the same CPU that it was set up on and causes side effects leading to the crash otherwise.

in reply to:  3 ; comment:4 by bonefish, 12 years ago

Replying to mmlr:

So I'd suspect that the vm86 stuff needs to run on the same CPU that it was set up on and causes side effects leading to the crash otherwise.

That's probably easy to verify. There's a kernel API to pin a thread to its CPU (thread_pin_to_current_cpu()/thread_unpin_from_current_cpu()).

in reply to:  4 comment:5 by mmlr, 12 years ago

Replying to bonefish:

Replying to mmlr:

So I'd suspect that the vm86 stuff needs to run on the same CPU that it was set up on and causes side effects leading to the crash otherwise.

That's probably easy to verify. There's a kernel API to pin a thread to its CPU (thread_pin_to_current_cpu()/thread_unpin_from_current_cpu()).

Yes, I realize, and it does work as a workaround as well. I'd like to figure out what actually happens though. The virtual 8086 mode should be task specific and switching between threads should just store/restore the state as with any other thread (the CPU isn't specifically set up, just an appropriate iframe is constructed). It obviously works on the same CPU when being rescheduled (forcing that by inserting some pauses into the code path) without any side effects, so I don't yet really see why it'd fail running on a different CPU.

comment:6 by bonefish, 12 years ago

I'm unfamiliar with that code. I would probably examine the crash situation a bit more. If, as you say, all libroot functions seem to crash, something may have happened to the respective area or just, lower level, to the page table(s). I usually add a panic() in vm_page_fault() in such cases, so it is easier to examine and it is very unlikely that something else happened in between that might have changed the situation.

Good luck hunting, anyway! :-)

comment:7 by mmlr, 12 years ago

Resolution: fixed
Status: newclosed

Fixed in hrev43321.

The libroot calls actually crashed in _errnop() where the clobbered TLS was used. Since TLS is re-setup on reschedule (more specifically on context switch) it also explains why I didn't see the crash when introducing enough pauses to always trigger a reschedule before returning from the vm86 code (even with SMP). Obviously staying on the same CPU worked as well as the restored TLS value happened to be the right one.

Note: See TracTickets for help on using tickets.