Opened 10 years ago

Closed 10 years ago

Last modified 10 years ago

#3399 closed bug (fixed)

The system reboots while running jam -q on the haiku trunk.

Reported by: bbjimmy Owned by: axeld
Priority: normal Milestone: R1
Component: System/Kernel Version: R1/pre-alpha1
Keywords: Cc: rossi@…, imker@…
Blocked By: Blocking:
Has a Patch: no Platform: All

Description

current version: hrev29100

Very intermittent problem. This happens about once every three times I compile haiku.

I am running an AMD Athlon XP 1.67 GHZ 448 MB ram with 565 MB swap

The only apps running at the time are the Terminal and ActivityMonitor. The jam seems to be working properly then suddenly the system reboots.

This is the only time my system reboots by its self.

Otherwise Haiku is very stable and I am using it for my main OS.

Any clues on how to troubleshoot this?

Attachments (1)

syslog (427.9 KB) - added by rossi 10 years ago.

Download all attachments as: .zip

Change History (15)

comment:1 Changed 10 years ago by rossi

Cc: rossi@… added

The same happens here (Core Solo U1400 / 1GB), however it actually happens far more often for me, building the full tree reboots the machine 4-5 times, didn't look into it yet.

Changed 10 years ago by rossi

Attachment: syslog added

comment:2 Changed 10 years ago by rossi

Attached the syslog of the most recent spontaneous reboot during "jam -q haiku-image".

Btw, after the crash the filesystem seems to be in a weird state, even after running checkfs, I can't zip the syslog after first cp'ing it into my home directoy. The system panics complaining about the fact, that the vnode already exists.

comment:3 Changed 10 years ago by siarzhuk

Cc: imker@… added

comment:4 Changed 10 years ago by siarzhuk

Just for information:

I observe the same symptoms regularly and not only during building Haiku. Sometime it occurs during building other programs like vim or mc, sometimes during performing configure scripts of mentioned programs. At least one of three builds ends with this "unrequested" reboot.

Note that for about of week working under Haiku GCC4 I have not observed this behavior at all. Looks like this is only GCC2 version issue. At least on my hardware. ;-)

My system is like follow:

  • Intel Pentium M Prozessor 740 (1,73 GHz, 2 MB L2 Cache, FSB 533)
  • Intel 915 PM Chipsatz und Intel PRO/Wireless 2200
  • ATI Mobility Radeon X700 XL PCI-Express mit 128 MB DDR RAM
  • SAMSUNG Festplatte 100 GB mit 5.400 U/min und 8 MB Cache
  • 1024 MB DDR II RAM (Beide Speicherbänke belegt)

Detailed description of my system is attached to ticket #1236.

comment:5 Changed 10 years ago by diver

Component: - GeneralSystem/Kernel

comment:6 Changed 10 years ago by bonefish

Please give hrev32073 a try. It doesn't fix the underlying issue, but it might prevent the triple fault (which such a reboot is) by better handling the double fault. Ideally you'll now be thrown into a functional KDL.

comment:7 Changed 10 years ago by mmlr

I have yet to try with your recent changes, but I have narrowed it down on my side to a workaround that fully removes the double/triple faults for me. If I change "vm_translation_map_arch_info::Delete()" in arch_vm_translation_map.cpp to always use the deferred_delete instead of the direct delete, the faults do not occur anymore. Is it possible that the translation map that is being freed there can still be in use? In that case overwriting it with deadbeef would explain everything going toast.

comment:8 in reply to:  7 Changed 10 years ago by bonefish

Replying to mmlr:

I have yet to try with your recent changes, but I have narrowed it down on my side to a workaround that fully removes the double/triple faults for me. If I change "vm_translation_map_arch_info::Delete()" in arch_vm_translation_map.cpp to always use the deferred_delete instead of the direct delete, the faults do not occur anymore. Is it possible that the translation map that is being freed there can still be in use?

I don't see how that could happen. The vm_translation_map_arch_info objects are ref-counted. And the ref-counting scheme is extremely simple (feel encouraged to review):

  • The translation map creating the arch info objects owns the initial reference and frees it in destroy_tmap().
  • A CPU using an arch info has a reference to it. The CPUs' initial references to the kernel translation map arch info are acquired in arch_cpu_init_post_vm(). When the arch info changes (in arch_thread_context_switch()) the reference of the old arch info is released, and one acquired for the new arch info.

comment:9 Changed 10 years ago by mmlr

I'm sorry to tell, but even with an updated kernel including all your changes it still triplefaults :-(. Tracing output from just before the reboot clearly shows that the structures the scheduler uses are corrupted (as was to be expected). I will now try to review the translation map issue. Of course it's possible that something leading up to there messes up.

comment:10 in reply to:  9 Changed 10 years ago by bonefish

Replying to mmlr:

I'm sorry to tell, but even with an updated kernel including all your changes it still triplefaults :-(.

You could add a while (true); at the beginning of x86_double_fault_exception() (in arch_int.cpp) to verify that the double fault handler is taken at least.

Unfortunately there has to be some trade-off between safely catching the double fault and still being able to get useful info in the kernel debugger (respectively being able to enter the kernel debugger at all). If the basic VM, CPU, ICI, or kernel debugger structures have been corrupted, the odds are that a double fault will end in a triple fault or an infinite exception loop. With some more work we could push the limit a bit further. Given how annoying triple faults are to debug that might even be worth it.

Tracing output from just before the reboot clearly shows that the structures the scheduler uses are corrupted (as was to be expected). I will now try to review the translation map issue. Of course it's possible that something leading up to there messes up.

Yeah, e.g. corrupted/deleted thread or team structure could theoretically cause any kind of damage, though usually things just crash earlier and without double-faulting in such a case.

comment:11 Changed 10 years ago by mmlr

Resolution: fixed
Status: newclosed

Fixed in hrev32118. I hope I described it well enough in the commit message. In any case it would be possible to fix this in different ways. For example it would be possible to cause it to explicitly set the kernel page dir in the currently unused arch_vm_aspace_swap() function. Or it would be possible to simply read out cr3 on deletion and reset it to the kernel page dir when the about to be deleted page dir is detected (that's how I debugged this issue in the end). Feel free to suggest/implement other solutions as you see fit.

comment:12 Changed 10 years ago by axeld

One could think to move the interrupt disabling/enabling into the assembly code as well - at least other architectures don't need to call this with interrupts disabled, and it would also be slightly faster, too (depending on the compiler, that is).

comment:13 in reply to:  12 ; Changed 10 years ago by mmlr

Replying to axeld:

One could think to move the interrupt disabling/enabling into the assembly code as well - at least other architectures don't need to call this with interrupts disabled, and it would also be slightly faster, too (depending on the compiler, that is).

Well, both calls take place from architecture dependent code, so other archs aren't affected. I thought about disabling interrupts from the assembly code, but seeing that it is more than just a line and disable_interrupts()/restore_interrupts() being implemented as inlined inline assembly functions I thought it wasn't really necessary to duplicate it.

comment:14 in reply to:  13 Changed 10 years ago by axeld

Replying to mmlr:

Well, both calls take place from architecture dependent code, so other archs aren't affected.

Okay, I missed that, great then! :-)

Note: See TracTickets for help on using tickets.