Opened 12 months ago

Closed 2 weeks ago

#14530 closed bug (fixed)

KDL when running profiler

Reported by: luroh Owned by: nobody
Priority: normal Milestone: Unscheduled
Component: System/Kernel Version: R1/Development
Keywords: Cc: korli, mmlr
Blocked By: Blocking: #15108
Has a Patch: no Platform: x86-64

Description

64-bit r1beta1 hrev52295+91

Fairly easy to reproduce, happens ~20% of the time when running 'profile HaikuDepot > output.txt'

PANIC: Unexpected exception "General Protection Exception" occurred in kernel mode! Error code: 0x0

Welcome to Kernel Debugging Land...
Thread 503 "BUrlProtocol.HTTP" running on CPU 0
stack trace for thread 503 "BUrlProtocol.HTTP"
    kernel stack: 0xffffffff88901000 to 0xffffffff88906000
      user stack: 0x00007fdb34a0d000 to 0x00007fdb34a4d000
frame                       caller             <image>:function + offset
 0 ffffffff88905438 (+  24) ffffffff8013fccc   <kernel_x86_64> arch_debug_call_with_fault_handler + 0x16
 1 ffffffff88905450 (+  80) ffffffff800a7978   <kernel_x86_64> debug_call_with_fault_handler + 0x68
 2 ffffffff889054a0 (+  96) ffffffff800a9321   <kernel_x86_64> kernel_debugger_loop(char const*, char const*, __va_list_tag*, intkdebug> 

Change History (9)

comment:1 by diver, 12 months ago

Component: - GeneralSystem/Kernel

Same KDL here.

comment:2 by diver, 3 months ago

Blocking: 15108 added

comment:3 by luroh, 3 months ago

Platform: Allx86-64

comment:4 by waddlesplash, 2 weeks ago

It also happens when running Web+, also in a BUrlProtocol thread. But I actually get another few lines of the backtrace:

PANIC: Unexpected exception "General Protection Exception" occurred in kernel mode! Error code: 0x0

Welcome to Kernel Debugging Land...
Thread 521 "BUrlProtocol.HTTP" running on CPU 0
stack trace for thread 521 "BUrlProtocol.HTTP"
    kernel stack: 0xffffffff81427000 to 0xffffffff8142c000
      user stack: 0x00007f51b4c79000 to 0x00007f51b4cb9000
frame                       caller             <image>:function + offset
 0 ffffffff8142b468 (+  24) ffffffff8014df7c   <kernel_x86_64> arch_debug_call_with_fault_handler + 0x16
 1 ffffffff8142b480 (+  80) ffffffff800ad928   <kernel_x86_64> debug_call_with_fault_handler + 0x88
 2 ffffffff8142b4d0 (+  96) ffffffff800af2b1   <kernel_x86_64> kernel_debugger_loop(char const*, char const*, __va_list_tag*, int) + 0xf1
 3 ffffffff8142b530 (+  80) ffffffff800af5ae   <kernel_x86_64> kernel_debugger_internal(char const*, char const*, __va_list_tag*, int) + 0x6e
 4 ffffffff8142b580 (+ 240) ffffffff800af917   <kernel_x86_64> panic + 0xb7
 5 ffffffff8142b670 (+ 224) ffffffff801586c8   <kernel_x86_64> x86_unexpected_exception + 0x168
 6 ffffffff8142b750 (+ 536) ffffffff8014f822   <kernel_x86_64> int_bottom + 0x56
kernel iframe at 0xffffffff8142b968 (end = 0xffffffff8142ba30)
 rax 0xffffffff8142bb20    rbx 0xe3e53d9b9ff0025d    rcx 0x0
 rdx 0x10                  rsi 0xe3e53d9b9ff0025d    rdi 0xffffffff8142bb20
 rbp 0xffffffff8142ba50     r8 0xffffffff87250b20     r9 0x8f76a49097a6ffc3
 r10 0x784b8d21b5b4a9e2    r11 0x84ae96afe2bba       r12 0xffffffff8142bb88
 r13 0xffffffff8142bb80    r14 0xe3e53d9b9ff0025d    r15 0x0
 rip 0xffffffff8016d800    rsp 0xffffffff8142ba30 rflags 0x13016
 vector: 0xd, error code: 0x0
 7 ffffffff8142b968 (+ 232) ffffffff8016d800   <kernel_x86_64> memcpy + 0x50
 8 ffffffff8142ba50 (+ 112) ffffffff8012c30c   kdebug> 

And then the return address that it somehow failed to lookup a symbol for is...:

kdebug> ls 0xffffffff8012c30c
0xffffffff8012c30c = _ZN12_GLOBAL__N_111user_accessIZNS_20arch_cpu_user_memcpyEPvPKvmEUlvE_EEbT_ + 0xac (kernel_x86_64)

It's odd that the stack trace couldn't get that; is the symbol name too long or something?

The address it's trying to copy to (0xe3e53d9b9ff0025d) is clearly junk, and since it isn't in canonical form, the fault handler can't catch it.

Is the profiler somehow not restoring state properly on x86_64? But then why does BUrlProtocol.HTTP seem to be the only thing that can trigger this?

comment:5 by waddlesplash, 2 weeks ago

Aha! The problem is that the demanglers cause a page fault when trying to demangle it. With "sc -d" to disable demangling:

stack trace for thread 521 "BUrlProtocol.HTTP"
    kernel stack: 0xffffffff81427000 to 0xffffffff8142c000
      user stack: 0x00007f51b4c79000 to 0x00007f51b4cb9000
frame                       caller             <image>:function + offset
 0 ffffffff8142b1a8 (+  32) ffffffff800b0859   <kernel_x86_64> _ZL25invoke_command_trampolinePv + 0x19
 1 ffffffff8142b1c8 (+  24) ffffffff8014df7c   <kernel_x86_64> arch_debug_call_with_fault_handler + 0x16
 2 ffffffff8142b1e0 (+  80) ffffffff800ad928   <kernel_x86_64> debug_call_with_fault_handler + 0x88
 3 ffffffff8142b230 (+  96) ffffffff800b0adf   <kernel_x86_64> invoke_debugger_command + 0xef
 4 ffffffff8142b290 (+  64) ffffffff800b0c59   <kernel_x86_64> _ZL19invoke_pipe_segmentP21debugger_command_pipeiPc + 0xf9
 5 ffffffff8142b2d0 (+  80) ffffffff800b0d6c   <kernel_x86_64> invoke_debugger_command_pipe + 0xac
 6 ffffffff8142b320 (+  96) ffffffff800b59f8   <kernel_x86_64> _ZN16ExpressionParser17_ParseCommandPipeERi + 0x118
 7 ffffffff8142b380 (+  96) ffffffff800bc6b3   <kernel_x86_64> _ZN16ExpressionParser15EvaluateCommandEPKcRi + 0xd83
 8 ffffffff8142b3e0 (+ 240) ffffffff800bec5c   <kernel_x86_64> evaluate_debug_command + 0x11c
 9 ffffffff8142b4d0 (+  96) ffffffff800af370   <kernel_x86_64> _ZL20kernel_debugger_loopPKcS0_P13__va_list_tagi + 0x1b0
10 ffffffff8142b530 (+  80) ffffffff800af5ae   <kernel_x86_64> _ZL24kernel_debugger_internalPKcS0_P13__va_list_tagi + 0x6e
11 ffffffff8142b580 (+ 240) ffffffff800af917   <kernel_x86_64> panic + 0xb7
12 ffffffff8142b670 (+ 224) ffffffff801586c8   <kernel_x86_64> x86_unexpected_exception + 0x168
13 ffffffff8142b750 (+ 536) ffffffff8014f822   <kernel_x86_64> int_bottom + 0x56
kernel iframe at 0xffffffff8142b968 (end = 0xffffffff8142ba30)
 rax 0xffffffff8142bb20    rbx 0xe3e53d9b9ff0025d    rcx 0x0
 rdx 0x10                  rsi 0xe3e53d9b9ff0025d    rdi 0xffffffff8142bb20
 rbp 0xffffffff8142ba50     r8 0xffffffff87250b20     r9 0x8f76a49097a6ffc3
 r10 0x784b8d21b5b4a9e2    r11 0x84ae96afe2bba       r12 0xffffffff8142bb88
 r13 0xffffffff8142bb80    r14 0xe3e53d9b9ff0025d    r15 0x0
 rip 0xffffffff8016d800    rsp 0xffffffff8142ba30 rflags 0x13016
 vector: 0xd, error code: 0x0
14 ffffffff8142b968 (+ 232) ffffffff8016d800   <kernel_x86_64> memcpy + 0x50
15 ffffffff8142ba50 (+ 112) ffffffff8012c30c   <kernel_x86_64> _ZN12_GLOBAL__N_111user_accessIZNS_20arch_cpu_user_memcpyEPvPKvmEUlvE_EEbT_ + 0xac
16 ffffffff8142bac0 (+  80) ffffffff8013480c   <kernel_x86_64> user_memcpy + 0x2c
17 ffffffff8142bb10 (+  64) ffffffff80155f0b   <kernel_x86_64> _ZL26get_next_frame_no_debuggermPmS_bPN7BKernel6ThreadE + 0x3b
18 ffffffff8142bb50 (+ 112) ffffffff80157022   <kernel_x86_64> arch_debug_get_stack_trace + 0x92
19 ffffffff8142bbc0 (+  96) ffffffff800c458b   <kernel_x86_64> _ZN14SystemProfiler9_DoSampleEv + 0x5b
20 ffffffff8142bc20 (+  32) ffffffff800c46a6   <kernel_x86_64> _ZN14SystemProfiler15_ProfilingEventEP5timer + 0x16
21 ffffffff8142bc40 (+  96) ffffffff8008bab4   <kernel_x86_64> timer_interrupt + 0xd4
22 ffffffff8142bca0 (+  96) ffffffff8005fbf9   <kernel_x86_64> int_io_interrupt_handler + 0xb9
23 ffffffff8142bd00 (+  32) ffffffff801580d9   <kernel_x86_64> x86_hardware_interrupt + 0xd9
24 ffffffff8142bd20 (+ 536) ffffffff8014f8fd   <kernel_x86_64> int_bottom_user + 0xb2
user iframe at 0xffffffff8142bf38 (end = 0xffffffff8142c000)
 rax 0xda25b9c6d8dfc115    rbx 0x3b853972a5ae287     rcx 0x7
 rdx 0x84ae96afe2bba       rsi 0x6468b782aa1f92a8    rdi 0x7f51b4cb6330
 rbp 0x1cc4f60              r8 0xd30562cc1de268f8     r9 0x8f76a49097a6ffc3
 r10 0x784b8d21b5b4a9e2    r11 0x84ae96afe2bba       r12 0x322d75af7942e4c7
 r13 0x3356323bbbc60762    r14 0x4eae0517e1a116b8    r15 0x80b9c88ad2c78c4
 rip 0xf2b6db7ea7          rsp 0x7f51b4cb61f8     rflags 0x13202
 vector: 0xfb, error code: 0x0
25 ffffffff8142bf38 (+2156498984) 000000f2b6db7ea7   <libcrypto.so.1.0.0> bn_power5 (nearest) + 0x7a7
26 0000000001cc4f60 (+   0) e4109b192ab62957   
e3e53d9b9ff0025d -- read fault

comment:6 by waddlesplash, 2 weeks ago

Cc: korli mmlr added

So, it appears libcrypto has hand-written assembly that does all kinds of fun stuff to the registers and makes them invalid, which is why we are trying to read a garbage pointer.

CC'ing korli and mmlr. The address is clearly in non-canonical form, I guess we should check for this in user_memcpy and just bail immediately if it is?

comment:7 by waddlesplash, 2 weeks ago

Alternatively we could modify this code to call the fault handler even under GPFs: https://xref.plausible.coop/source/xref/haiku/src/system/kernel/arch/x86/64/descriptors.cpp#349 I don't know the implications of that however, and if we should take that route.

comment:8 by waddlesplash, 2 weeks ago

Actually, IS_USER_ADDRESS already checks for canonical form because it looks for things < USER_TOP. Then what we really should do is verify here that the address specified is really a user one.

comment:9 by waddlesplash, 2 weeks ago

Resolution: fixed
Status: newclosed

Done in hrev53459.

Note: See TracTickets for help on using tickets.