Opened 3 years ago

Closed 2 months ago

Last modified 2 months ago

#13927 closed bug (fixed)

gpgme KDL's when test are enabled

Reported by: Begasus Owned by: axeld
Priority: normal Milestone: R1/beta2
Component: Network & Internet/TCP Version: R1/Development
Keywords: Cc:
Blocked By: Blocking:
Platform: All

Description

When trying to update to the latest version for gpgme I ran into KDL's (reproducible), I initially filed an issue it on haikuports and it seems legit at that time, updating the system with full-sync removed the KDL from the build, but then the problem was still there when after the build the filesystem couldn't be unmounted, I could kill the instance with PC and the got back out of the unmounting problem. After that I try'd to build with the option "--disable-gpg-test" and then the build was ok, but I couldn't run the test with "hp --test *" Then I retry'd the build with the test enabled, knowing the problem with unmount. But when trying to run the test again inside haikuporter it send me back into KDL, more info and links at: https://github.com/haikuports/haikuports/issues/2040

Attachments (1)

gpg_kdl.txt (149.8 KB ) - added by Begasus 3 years ago.
serial output from VirtualBox on the last attempt …

Download all attachments as: .zip

Change History (23)

by Begasus, 3 years ago

Attachment: gpg_kdl.txt added

serial output from VirtualBox on the last attempt ...

comment:1 by Begasus, 3 years ago

patch: 01

comment:2 by diver, 3 years ago

syslog doesn't seem to contain KDL output.

comment:3 by pulkomandy, 3 years ago

PANIC: vm_page_fault: unhandled page fault in kernel space at 0xdeadbefb, ip 0x81251e16
 
Welcome to Kernel Debugging Land...
Thread 249 "loop consumer" running on CPU 0
stack trace for thread 249 "loop consumer"
    kernel stack: 0x81174000 to 0x81178000
frame               caller     <image>:function + offset
 0 81177a34 (+  32) 8014f416   <kernel_x86> arch_debug_stack_trace + 0x12
 1 81177a54 (+  16) 800a9fbf   <kernel_x86> stack_trace_trampoline(NULL) + 0x0b
 2 81177a64 (+  12) 801411ce   <kernel_x86> arch_debug_call_with_fault_handler + 0x1b
 3 81177a70 (+  48) 800abae6   <kernel_x86> debug_call_with_fault_handler + 0x5a
 4 81177aa0 (+  64) 800aa1db   <kernel_x86> kernel_debugger_loop([34m0x80193777[0m [36m"PANIC: "[0m, [34m0x801aa440[0m [36m"vm_page_fault: unhandled page fault in kernel space at 0x%lx, ip 0x%lx
"[0m, [34m0x81177b4c[0m [36m"\FB\BE\AD\DE
%\81\93[\80@V\9F\82`
G\DD,|\81D\AA\8D\82\B2\80D\AA\8D\82"[0m, int32: [34m0[0m) + 0x217
 5 81177ae0 (+  48) 800aa557   <kernel_x86> kernel_debugger_internal([34m0x80193777[0m [36m"PANIC: "[0m, [34m0x801aa440[0m [36m"vm_page_fault: unhandled page fault in kernel space at 0x%lx, ip 0x%lx
"[0m, [34m0x81177b4c[0m [36m"\FB\BE\AD\DE
%\81\93[\80@V\9F\82`
G\DD,|\81D\AA\8D\82\B2\80D\AA\8D\82"[0m, int32: [34m0[0m) + 0x53
 6 81177b10 (+  48) 800abe72   <kernel_x86> panic + 0x3a
 7 81177b40 (+ 144) 80125c9d   <kernel_x86> vm_page_fault + 0x145
 8 81177bd0 (+  80) 80150c7f   <kernel_x86> x86_page_fault_exception + 0x17f
 9 81177c20 (+  12) 80143b8c   <kernel_x86> int_bottom + 0x3c
kernel iframe at 0x81177c2c (end = 0x81177c7c)
 eax 0xdeadbeef    ebx 0x81256178     ecx 0xdd458354  edx 0x4
 esi 0xdd471c60    edi 0xdd471c6c     ebp 0x81177cb4  esp 0x81177c60
 eip 0x81251e16 eflags 0x210286  
 vector: 0xe, error code: 0x0
10 81177c2c (+ 136) 81251e16   </boot/system/add-ons/kernel/network/protocols/tcp> ConnectionHashDefinition<[32m0xd259674c[0m>::Compare(ConnectionHashDefinition: [34m0x81177cec[0m) + 0x22
11 81177cb4 (+  64) 812522da   </boot/system/add-ons/kernel/network/protocols/tcp> EndpointManager<[32m0xd2596720[0m>::_LookupConnection(sockaddr*: [34m0xdd458354[0m) + 0x66
12 81177cf4 (+ 336) 81252d23   </boot/system/add-ons/kernel/network/protocols/tcp> EndpointManager<[32m0xd2596720[0m>::FindConnection(sockaddr*: [34m0xdd458354[0m) + 0xc7
13 81177e44 (+ 160) 8124c9e7   </boot/system/add-ons/kernel/network/protocols/tcp> tcp_receive_data(net_buffer*: [34m0xdd458298[0m) + 0x31b
14 81177ee4 (+ 128) 8115b66a   </boot/system/add-ons/kernel/network/protocols/ipv4> ipv4_receive_data(net_buffer*: [34m0xdd458298[0m) + 0x36a
15 81177f64 (+  80) 8113c9ce   </boot/system/add-ons/kernel/network/stack> device_consumer_thread([34m0xd203d000[0m) + 0x6e
16 81177fb4 (+  48) 80089207   <kernel_x86> common_thread_entry([34m0x81177ff0[0m) + 0x3b
kdebug> co
read access attempted on read-protected area 0x57 at 0xdeadb000
vm_page_fault: vm_soft_fault returned error 'Permission denied' on fault at 0xdeadbefb, ip 0x81251e16, write 0, user 0, thread 0xf9
PANIC: vm_page_fault: unhandled page fault in kernel space at 0xdeadbefb, ip 0x81251e16
 
Welcome to Kernel Debugging Land...
Thread 249 "loop consumer" running on CPU 0
stack trace for thread 249 "loop consumer"
    kernel stack: 0x81174000 to 0x81178000
frame               caller     <image>:function + offset
 0 81177a34 (+  32) 8014f416   <kernel_x86> arch_debug_stack_trace + 0x12
 1 81177a54 (+  16) 800a9fbf   <kernel_x86> stack_trace_trampoline(NULL) + 0x0b
 2 81177a64 (+  12) 801411ce   <kernel_x86> arch_debug_call_with_fault_handler + 0x1b
 3 81177a70 (+  48) 800abae6   <kernel_x86> debug_call_with_fault_handler + 0x5a
 4 81177aa0 (+  64) 800aa1db   <kernel_x86> kernel_debugger_loop([34m0x80193777[0m [36m"PANIC: "[0m, [34m0x801aa440[0m [36m"vm_page_fault: unhandled page fault in kernel space at 0x%lx, ip 0x%lx
"[0m, [34m0x81177b4c[0m [36m"\FB\BE\AD\DE
%\81\93[\80@V\9F\82`
G\DD,|\81D\AA\8D\82\B2\80D\AA\8D\82"[0m, int32: [34m0[0m) + 0x217
 5 81177ae0 (+  48) 800aa557   <kernel_x86> kernel_debugger_internal([34m0x80193777[0m [36m"PANIC: "[0m, [34m0x801aa440[0m [36m"vm_page_fault: unhandled page fault in kernel space at 0x%lx, ip 0x%lx
"[0m, [34m0x81177b4c[0m [36m"\FB\BE\AD\DE
%\81\93[\80@V\9F\82`
G\DD,|\81D\AA\8D\82\B2\80D\AA\8D\82"[0m, int32: [34m0[0m) + 0x53
 6 81177b10 (+  48) 800abe72   <kernel_x86> panic + 0x3a
 7 81177b40 (+ 144) 80125c9d   <kernel_x86> vm_page_fault + 0x145
 8 81177bd0 (+  80) 80150c7f   <kernel_x86> x86_page_fault_exception + 0x17f
 9 81177c20 (+  12) 80143b8c   <kernel_x86> int_bottom + 0x3c
kernel iframe at 0x81177c2c (end = 0x81177c7c)
 eax 0xdeadbeef    ebx 0x81256178     ecx 0xdd458354  edx 0x4
 esi 0xdd471c60    edi 0xdd471c6c     ebp 0x81177cb4  esp 0x81177c60
 eip 0x81251e16 eflags 0x210286  
 vector: 0xe, error code: 0x0
10 81177c2c (+ 136) 81251e16   </boot/system/add-ons/kernel/network/protocols/tcp> ConnectionHashDefinition<[32m0xd259674c[0m>::Compare(ConnectionHashDefinition: [34m0x81177cec[0m) + 0x22
11 81177cb4 (+  64) 812522da   </boot/system/add-ons/kernel/network/protocols/tcp> EndpointManager<[32m0xd2596720[0m>::_LookupConnection(sockaddr*: [34m0xdd458354[0m) + 0x66
12 81177cf4 (+ 336) 81252d23   </boot/system/add-ons/kernel/network/protocols/tcp> EndpointManager<[32m0xd2596720[0m>::FindConnection(sockaddr*: [34m0xdd458354[0m) + 0xc7
13 81177e44 (+ 160) 8124c9e7   </boot/system/add-ons/kernel/network/protocols/tcp> tcp_receive_data(net_buffer*: [34m0xdd458298[0m) + 0x31b
14 81177ee4 (+ 128) 8115b66a   </boot/system/add-ons/kernel/network/protocols/ipv4> ipv4_receive_data(net_buffer*: [34m0xdd458298[0m) + 0x36a
15 81177f64 (+  80) 8113c9ce   </boot/system/add-ons/kernel/network/stack> device_consumer_thread([34m0xd203d000[0m) + 0x6e
16 81177fb4 (+  48) 80089207   <kernel_x86> common_thread_entry([34m0x81177ff0[0m) + 0x3b
kdebug> sc
stack trace for thread 249 "loop consumer"
    kernel stack: 0x81174000 to 0x81178000
frame               caller     <image>:function + offset
 0 81177824 (+  48) 800acc30   <kernel_x86> invoke_command_trampoline([34m0x811778bc[0m) + 0x1c
 1 81177854 (+  12) 801411ce   <kernel_x86> arch_debug_call_with_fault_handler + 0x1b
 2 81177860 (+  48) 800abae6   <kernel_x86> debug_call_with_fault_handler + 0x5a
 3 81177890 (+  64) 800ace89   <kernel_x86> invoke_debugger_command + 0xb9
 4 811778d0 (+  64) 800accb5   <kernel_x86> invoke_pipe_segment(debugger_command_pipe*: [34m0xcce38030[0m, int32: [34m0[0m, [34m0x0[0m [31m"<NULL>"[0m) + 0x79
 5 81177910 (+  64) 800acff4   <kernel_x86> invoke_debugger_command_pipe + 0x9c
 6 81177950 (+  48) 800aef34   <kernel_x86> ExpressionParser<[32m0x81177a00[0m>::_ParseCommandPipe([34m0x811779fc[0m) + 0x234
 7 81177980 (+  64) 800ae154   <kernel_x86> ExpressionParser<[32m0x81177a00[0m>::EvaluateCommand([34m0x801c52c0[0m [36m"sc"[0m, [34m0x811779fc[0m) + 0x2d0
 8 811779c0 (+ 224) 800b0580   <kernel_x86> evaluate_debug_command + 0x80
 9 81177aa0 (+  64) 800aa2ea   <kernel_x86> kernel_debugger_loop([34m0x80193777[0m [36m"PANIC: "[0m, [34m0x801aa440[0m [36m"vm_page_fault: unhandled page fault in kernel space at 0x%lx, ip 0x%lx
"[0m, [34m0x81177b4c[0m [36m"\FB\BE\AD\DE
%\81\93[\80@V\9F\82`
G\DD,|\81D\AA\8D\82\B2\80D\AA\8D\82"[0m, int32: [34m0[0m) + 0x326
10 81177ae0 (+  48) 800aa557   <kernel_x86> kernel_debugger_internal([34m0x80193777[0m [36m"PANIC: "[0m, [34m0x801aa440[0m [36m"vm_page_fault: unhandled page fault in kernel space at 0x%lx, ip 0x%lx
"[0m, [34m0x81177b4c[0m [36m"\FB\BE\AD\DE
%\81\93[\80@V\9F\82`
G\DD,|\81D\AA\8D\82\B2\80D\AA\8D\82"[0m, int32: [34m0[0m) + 0x53
11 81177b10 (+  48) 800abe72   <kernel_x86> panic + 0x3a
12 81177b40 (+ 144) 80125c9d   <kernel_x86> vm_page_fault + 0x145
13 81177bd0 (+  80) 80150c7f   <kernel_x86> x86_page_fault_exception + 0x17f
14 81177c20 (+  12) 80143b8c   <kernel_x86> int_bottom + 0x3c
kernel iframe at 0x81177c2c (end = 0x81177c7c)
 eax 0xdeadbeef    ebx 0x81256178     ecx 0xdd458354  edx 0x4
 esi 0xdd471c60    edi 0xdd471c6c     ebp 0x81177cb4  esp 0x81177c60
 eip 0x81251e16 eflags 0x210286  
 vector: 0xe, error code: 0x0
15 81177c2c (+ 136) 81251e16   </boot/system/add-ons/kernel/network/protocols/tcp> ConnectionHashDefinition<[32m0xd259674c[0m>::Compare(ConnectionHashDefinition: [34m0x81177cec[0m) + 0x22
16 81177cb4 (+  64) 812522da   </boot/system/add-ons/kernel/network/protocols/tcp> EndpointManager<[32m0xd2596720[0m>::_LookupConnection(sockaddr*: [34m0xdd458354[0m) + 0x66
17 81177cf4 (+ 336) 81252d23   </boot/system/add-ons/kernel/network/protocols/tcp> EndpointManager<[32m0xd2596720[0m>::FindConnection(sockaddr*: [34m0xdd458354[0m) + 0xc7
18 81177e44 (+ 160) 8124c9e7   </boot/system/add-ons/kernel/network/protocols/tcp> tcp_receive_data(net_buffer*: [34m0xdd458298[0m) + 0x31b
19 81177ee4 (+ 128) 8115b66a   </boot/system/add-ons/kernel/network/protocols/ipv4> ipv4_receive_data(net_buffer*: [34m0xdd458298[0m) + 0x36a
20 81177f64 (+  80) 8113c9ce   </boot/system/add-ons/kernel/network/stack> device_consumer_thread([34m0xd203d000[0m) + 0x6e
21 81177fb4 (+  48) 80089207   <kernel_x86> common_thread_entry([34m0x81177ff0[0m) + 0x3b
kdebug> co
read access attempted on read-protected area 0x57 at 0xdeadb000
vm_page_fault: vm_soft_fault returned error 'Permission denied' on fault at 0xdeadbefb, ip 0x81251e16, write 0, user 0, thread 0xf9
PANIC: vm_page_fault: unhandled page fault in kernel space at 0xdeadbefb, ip 0x81251e16
 
Welcome to Kernel Debugging Land...
Thread 249 "loop consumer" running on CPU 1
stack trace for thread 249 "loop consumer"
    kernel stack: 0x81174000 to 0x81178000
frame               caller     <image>:function + offset
 0 81177a34 (+  32) 8014f416   <kernel_x86> arch_debug_stack_trace + 0x12
 1 81177a54 (+  16) 800a9fbf   <kernel_x86> stack_trace_trampoline(NULL) + 0x0b
 2 81177a64 (+  12) 801411ce   <kernel_x86> arch_debug_call_with_fault_handler + 0x1b
 3 81177a70 (+  48) 800abae6   <kernel_x86> debug_call_with_fault_handler + 0x5a
 4 81177aa0 (+  64) 800aa1db   <kernel_x86> kernel_debugger_loop([34m0x80193777[0m [36m"PANIC: "[0m, [34m0x801aa440[0m [36m"vm_page_fault: unhandled page fault in kernel space at 0x%lx, ip 0x%lx
"[0m, [34m0x81177b4c[0m [36m"\FB\BE\AD\DE
%\81\93[\80@V\9F\82`
G\DD,|\81D\AA\8D\82\B2\80D\AA\8D\82"[0m, int32: [34m1[0m) + 0x217
 5 81177ae0 (+  48) 800aa557   <kernel_x86> kernel_debugger_internal([34m0x80193777[0m [36m"PANIC: "[0m, [34m0x801aa440[0m [36m"vm_page_fault: unhandled page fault in kernel space at 0x%lx, ip 0x%lx
"[0m, [34m0x81177b4c[0m [36m"\FB\BE\AD\DE
%\81\93[\80@V\9F\82`
G\DD,|\81D\AA\8D\82\B2\80D\AA\8D\82"[0m, int32: [34m1[0m) + 0x53
 6 81177b10 (+  48) 800abe72   <kernel_x86> panic + 0x3a
 7 81177b40 (+ 144) 80125c9d   <kernel_x86> vm_page_fault + 0x145
 8 81177bd0 (+  80) 80150c7f   <kernel_x86> x86_page_fault_exception + 0x17f
 9 81177c20 (+  12) 80143b8c   <kernel_x86> int_bottom + 0x3c
kernel iframe at 0x81177c2c (end = 0x81177c7c)
 eax 0xdeadbeef    ebx 0x81256178     ecx 0xdd458354  edx 0x4
 esi 0xdd471c60    edi 0xdd471c6c     ebp 0x81177cb4  esp 0x81177c60
 eip 0x81251e16 eflags 0x210286  
 vector: 0xe, error code: 0x0
10 81177c2c (+ 136) 81251e16   </boot/system/add-ons/kernel/network/protocols/tcp> ConnectionHashDefinition<[32m0xd259674c[0m>::Compare(ConnectionHashDefinition: [34m0x81177cec[0m) + 0x22
11 81177cb4 (+  64) 812522da   </boot/system/add-ons/kernel/network/protocols/tcp> EndpointManager<[32m0xd2596720[0m>::_LookupConnection(sockaddr*: [34m0xdd458354[0m) + 0x66
12 81177cf4 (+ 336) 81252d23   </boot/system/add-ons/kernel/network/protocols/tcp> EndpointManager<[32m0xd2596720[0m>::FindConnection(sockaddr*: [34m0xdd458354[0m) + 0xc7
13 81177e44 (+ 160) 8124c9e7   </boot/system/add-ons/kernel/network/protocols/tcp> tcp_receive_data(net_buffer*: [34m0xdd458298[0m) + 0x31b
14 81177ee4 (+ 128) 8115b66a   </boot/system/add-ons/kernel/network/protocols/ipv4> ipv4_receive_data(net_buffer*: [34m0xdd458298[0m) + 0x36a
15 81177f64 (+  80) 8113c9ce   </boot/system/add-ons/kernel/network/stack> device_consumer_thread([34m0xd203d000[0m) + 0x6e
16 81177fb4 (+  48) 80089207   <kernel_x86> common_thread_entry([34m0x81177ff0[0m) + 0x3b
kdebug> shutdown
kdebug> co
read access attempted on read-protected area 0x57 at 0xdeadb000
vm_page_fault: vm_soft_fault returned error 'Permission denied' on fault at 0xdeadbefb, ip 0x81251e16, write 0, user 0, thread 0xf9
PANIC: vm_page_fault: unhandled page fault in kernel space at 0xdeadbefb, ip 0x81251e16
 
Welcome to Kernel Debugging Land...
Thread 249 "loop consumer" running on CPU 1
stack trace for thread 249 "loop consumer"
    kernel stack: 0x81174000 to 0x81178000
frame               caller     <image>:function + offset
 0 81177a34 (+  32) 8014f416   <kernel_x86> arch_debug_stack_trace + 0x12
 1 81177a54 (+  16) 800a9fbf   <kernel_x86> stack_trace_trampoline(NULL) + 0x0b
 2 81177a64 (+  12) 801411ce   <kernel_x86> arch_debug_call_with_fault_handler + 0x1b
 3 81177a70 (+  48) 800abae6   <kernel_x86> debug_call_with_fault_handler + 0x5a
 4 81177aa0 (+  64) 800aa1db   <kernel_x86> kernel_debugger_loop([34m0x80193777[0m [36m"PANIC: "[0m, [34m0x801aa440[0m [36m"vm_page_fault: unhandled page fault in kernel space at 0x%lx, ip 0x%lx
"[0m, [34m0x81177b4c[0m [36m"\FB\BE\AD\DE
%\81\93[\80@V\9F\82`
G\DD,|\81D\AA\8D\82\B2\80D\AA\8D\82"[0m, int32: [34m1[0m) + 0x217
 5 81177ae0 (+  48) 800aa557   <kernel_x86> kernel_debugger_internal([34m0x80193777[0m [36m"PANIC: "[0m, [34m0x801aa440[0m [36m"vm_page_fault: unhandled page fault in kernel space at 0x%lx, ip 0x%lx
"[0m, [34m0x81177b4c[0m [36m"\FB\BE\AD\DE
%\81\93[\80@V\9F\82`
G\DD,|\81D\AA\8D\82\B2\80D\AA\8D\82"[0m, int32: [34m1[0m) + 0x53
 6 81177b10 (+  48) 800abe72   <kernel_x86> panic + 0x3a
 7 81177b40 (+ 144) 80125c9d   <kernel_x86> vm_page_fault + 0x145
 8 81177bd0 (+  80) 80150c7f   <kernel_x86> x86_page_fault_exception + 0x17f
 9 81177c20 (+  12) 80143b8c   <kernel_x86> int_bottom + 0x3c
kernel iframe at 0x81177c2c (end = 0x81177c7c)
 eax 0xdeadbeef    ebx 0x81256178     ecx 0xdd458354  edx 0x4
 esi 0xdd471c60    edi 0xdd471c6c     ebp 0x81177cb4  esp 0x81177c60
 eip 0x81251e16 eflags 0x210286  
 vector: 0xe, error code: 0x0
10 81177c2c (+ 136) 81251e16   </boot/system/add-ons/kernel/network/protocols/tcp> ConnectionHashDefinition<[32m0xd259674c[0m>::Compare(ConnectionHashDefinition: [34m0x81177cec[0m) + 0x22
11 81177cb4 (+  64) 812522da   </boot/system/add-ons/kernel/network/protocols/tcp> EndpointManager<[32m0xd2596720[0m>::_LookupConnection(sockaddr*: [34m0xdd458354[0m) + 0x66
12 81177cf4 (+ 336) 81252d23   </boot/system/add-ons/kernel/network/protocols/tcp> EndpointManager<[32m0xd2596720[0m>::FindConnection(sockaddr*: [34m0xdd458354[0m) + 0xc7
13 81177e44 (+ 160) 8124c9e7   </boot/system/add-ons/kernel/network/protocols/tcp> tcp_receive_data(net_buffer*: [34m0xdd458298[0m) + 0x31b
14 81177ee4 (+ 128) 8115b66a   </boot/system/add-ons/kernel/network/protocols/ipv4> ipv4_receive_data(net_buffer*: [34m0xdd458298[0m) + 0x36a
15 81177f64 (+  80) 8113c9ce   </boot/system/add-ons/kernel/network/stack> device_consumer_thread([34m0xd203d000[0m) + 0x6e
16 81177fb4 (+  48) 80089207   <kernel_x86> common_thread_entry([34m0x81177ff0[0m) + 0x3b
kdebug> shutdown =[1D [1D=n[1D [1D[1D [1D+[1D [1D/[1D [1D>[1D [1D.[1D [1D,<MN[1D [1D[1D [1D[1D [1D[1D [1D0-[1D [1D[1D [1D-n
kdebug>
Last edited 3 years ago by pulkomandy (previous) (diff)

comment:4 by pulkomandy, 3 years ago

Possibly similar to #12567 (which affected UDP) in TCP code.

comment:5 by diver, 3 years ago

Component: - GeneralNetwork & Internet/TCP
Owner: changed from nobody to axeld

comment:6 by mmlr, 3 years ago

FWIW I am unable to reproduce this. I also don't really see the link to #12567. From a quick glance at the change to resolve that one and the corresponding TCP code I don't really see how one would apply the logic there. I have not actually investigated it in depth though.

comment:7 by pulkomandy, 2 years ago

patch: 10

comment:8 by ambroff, 6 months ago

I was using gnupg a couple of days ago and happened to randomly reproduce this exact problem. After reading through the code and adding lots of extra tracing to understand what was going on I have an explanation for what is happening.

The cool thing is that I can reproduce this 100% of the time with this single command:

> gpg --recv-keys 066DAFCB81E42C40

Which results in the following KDL

PANIC: Unexpected exception "General Protection Exception" occurred in kernel mode! Error code: 0x0

Welcome to Kernel Debugging Land...
Thread 217 "loop consumer" running on CPU 0
stack trace for thread 217 "loop consumer"
    kernel stack: 0xffffffff81c5a000 to 0xffffffff81c5f000
frame                       caller             <image>:function + offset
 0 ffffffff81c5e608 (+  24) ffffffff801567dc   <kernel_x86_64> arch_debug_call_with_fault_handler + 0x16
 1 ffffffff81c5e620 (+  80) ffffffff800b0218   <kernel_x86_64> debug_call_with_fault_handler + 0x88
 2 ffffffff81c5e670 (+  96) ffffffff800b1ba1   <kernel_x86_64> kernel_debugger_loop(char const*, char const*, __va_list_tag*, int) + 0xf1
 3 ffffffff81c5e6d0 (+  80) ffffffff800b1e9e   <kernel_x86_64> kernel_debugger_internal(char const*, char const*, __va_list_tag*, int) + 0x6e
 4 ffffffff81c5e720 (+ 240) ffffffff800b2207   <kernel_x86_64> panic + 0xb7
 5 ffffffff81c5e810 (+ 224) ffffffff80160f98   <kernel_x86_64> x86_unexpected_exception + 0x168
 6 ffffffff81c5e8f0 (+ 536) ffffffff80158082   <kernel_x86_64> int_bottom + 0x56
kernel iframe at 0xffffffff81c5eb08 (end = 0xffffffff81c5ebd0)
 rax 0xdeadbeefdeadbeef    rbx 0xffffffff82ade400    rcx 0x8
 rdx 0xffffffff82ade400    rsi 0xffffffff81c5ebf0    rdi 0xdeadbeefdeadbf0f
 rbp 0xffffffff81c5ebe0     r8 0xffffffff801f9860     r9 0x5f
 r10 0xf                   r11 0xf                   r12 0xffffffff81c5ebf0
 r13 0xffffffff81c5ebf0    r14 0x1                   r15 0xffffffff928e01f0
 rip 0xffffffff8e978446    rsp 0xffffffff81c5ebd0 rflags 0x13286
 vector: 0xd, error code: 0x0
 7 ffffffff81c5eb08 (+ 216) ffffffff8e978446   </boot/system/add-ons/kernel/network/protocols/tcp> ConnectionHashDefinition::Compare const(std::pair<sockaddr const*, sockaddr const*> const&, TCPEndpoint*) + 0x16
 8 ffffffff81c5ebe0 (+  64) ffffffff8e978871   </boot/system/add-ons/kernel/network/protocols/tcp> EndpointManager::_LookupConnection(sockaddr const*, sockaddr const*) + 0x81
 9 ffffffff81c5ec20 (+ 352) ffffffff8e978e10   </boot/system/add-ons/kernel/network/protocols/tcp> EndpointManager::FindConnection(sockaddr*, sockaddr*) + 0x60
10 ffffffff81c5ed80 (+ 240) ffffffff8e971723   </boot/system/add-ons/kernel/network/protocols/tcp> tcp_receive_data(net_buffer*) + 0x333
11 ffffffff81c5ee70 (+ 208) ffffffff81c4a327   </boot/system/add-ons/kernel/network/protocols/ipv4> ipv4_receive_data(net_buffer*) + 0x5e7
12 ffffffff81c5ef40 (+ 112) ffffffff8e70f152   </boot/system/add-ons/kernel/network/stack> device_consumer_thread(void*) + 0x82
13 ffffffff81c5efb0 (+  32) ffffffff8008d1f7   <kernel_x86_64> common_thread_entry(void*) + 0x37
14 ffffffff81c5efd0 (+2117734448) ffffffff81c5efe0
kdebug> qemu-system-x86_64: terminating on signal 2

The chain of events leading up to this looks something like:

  1. A socket is opened and connect is called. This calls EndpointManager::SetConnection(), which adds the TCPEndpoint for the connection to the EndpointManager::fConnectionHash BOpenHashTable here.
KWA tcp_init_protocol()
Creating new TCPEndpoint: 0xffffffff82ade400
KWA tcp_open(0xffffffff82ade400)
KWA tcp_get_domain(0xffffffff82ade400)
KWA tcp_getsockopt(0xffffffff82ade400, -1, 1073741833, value, length)
KWA tcp_bind(0xffffffff82ade400, 0xffffffff80592dc0)
EndpointManager::Bind(0xffffffff82ade400, 0xffffffff80592dc0)
EndpointManager::BindToEphemeral(0xffffffff82ade400)
   EndpointManager::BindToEphemeral(0xffffffff82ade400) -> 0.0.0.0:40373
EndpointManager::_Bind(0xffffffff82ade400, 0xffffffff80592c60)
EndpointManager::_BindToEphemeral(0xffffffff82ade400, 0xffffffff80592dc0) bound port 40373 and is returning status No error
KWA tcp_connect(0xffffffff82ade400, 1)
EndpointManager::SetConnection(0xffffffff82ade400)
EndpointManager::_LookupConnection(0xffffffff80592c80, 0xffffffff80592e80)
  1. In the kernel debugger, the tcp_endpoints command will show the inserted endpoint.
kdebug> tcp_endpoints
-------- TCP Domain 0xffffffff928e01e0 ---------
   address                 local                  peer   recv-q   send-q        state

CONNECTION_HASH (1 entries)
0xffffffff82ade400       127.0.0.1:40373        127.0.0.1:9050        0        0       closed
kdebug> es
  1. A packet is received on this endpoint
TCP: Received packet corresponds to explicit endpoint 0xffffffff82ade400
KWA finishing tcp_receive_data: endpoint=0xffffffff82ade400, segmentAction=1, buffer=0xffffffff92c622d0
  1. For some reason, tcp_connect() is called again on the same endpoint
KWA tcp_connect(0xffffffff82ade400, 1)

This of course leads to EndpointManager::SetConnection() being called again. Just before the endpoint is inserted into fConnectionHash again, the tcp_endpoints command still returns the same output as above

kdebug> tcp_endpoints
-------- TCP Domain 0xffffffff928e01e0 ---------
   address                 local                  peer   recv-q   send-q        state

CONNECTION_HASH (1 entries)
0xffffffff82ade400       127.0.0.1:40373        127.0.0.1:9050        0        0       closed
kdebug> es

But then SetConnection() calls fConnectionHash.Insert(endpoint) for an endpoint that is already in the table

EndpointManager::SetConnection(0xffffffff82ade400)

And now there is a cycle in the linked list for the hash of that endpoint in fConnectionHash**

kdebug> tcp_endpoints
-------- TCP Domain 0xffffffff928e01e0 ---------
   address                 local                  peer   recv-q   send-q        state

CONNECTION_HASH (2 entries)
0xffffffff82ade400       127.0.0.1:40373        127.0.0.1:9150        0        0       closed
0xffffffff82ade400       127.0.0.1:40373        127.0.0.1:9150        0        0       closed
0xffffffff82ade400       127.0.0.1:40373        127.0.0.1:9150        0        0       closed
0xffffffff82ade400       127.0.0.1:40373        127.0.0.1:9150        0        0       closed
0xffffffff82ade400       127.0.0.1:40373        127.0.0.1:9150        0        0       closed
0xffffffff82ade400       127.0.0.1:40373        127.0.0.1:9150        0        0       closed
0xffffffff82ade400       127.0.0.1:40373        127.0.0.1:9150        0        0       closed
0xffffffff82ade400       127.0.0.1:40373        127.0.0.1:9150        0        0       closed
0xffffffff82ade400       127.0.0.1:40373        127.0.0.1:9150        0        0       closed
...
0xffffffff82ade400       127.0.0.1:40373        127.0.0.1:9150        0        0       closed
q
kdebug> es
  1. The socket is closed, calling the destructor of the TCPSocket object fConnectionHash contains. This calls fConnectionHash.Remove(), but there are still entries in the table that point to that object. Since this is a debug build, after the destructor is called, 0xdeadbeef is memset in that location (which is why we see that in the original crash report).
KWA tcp_close(0xffffffff82ade400)
KWA tcp_receive_data(0xffffffff92c622d0)
KWA tcp_free(0xffffffff82ade400)
  Looking for: peer 127.0.0.1:40373, local 127.0.0.1:9150
KWA tcp_uninit_protocol(0xffffffff82ade400)
EndpointManager::FindConnection(0xffffffff92c623b8, 0xffffffff92c62338)
Deleting TCPEndpoint: 0xffffffff82ade400
EndpointManager::Unbind(0xffffffff82ade400)
  1. tcp_receive_data() is called because a packet is received from that same peer, 127.0.0.1:9150. This leads to EndpointManager::_LookupConnection() being called, which calls fConnectionHash.Lookup(local_socket, remote_socket). This hashes to the same position as our linked list cycle.
tcp_receive_data(): EndpointManager::FindConnection() returned NULL for endpoint 0x0000000000000000, replying with RST
KWA TCP: Sending RST...
add_tcp_header(): buffer 0xffffffff92d6a188, flags 0x14, seq 0, ack 20779736, up 0, win 0
TCP: RST sent with status No error...
KWA tcp_receive_data(0xffffffff92d6a188)
  Looking for: peer 127.0.0.1:9150, local 127.0.0.1:40373
EndpointManager::FindConnection(0xffffffff92d6a270, 0xffffffff92d6a1f0)
EndpointManager::_LookupConnection(0xffffffff92d6a270, 0xffffffff92d6a1f0)
  1. ConnectionHashDefinition::Compare is called to compare the params of fConnectionHash.Lookup to an item in the linked list at that position, and *kaboom*, the already freed TCPEndpoint is dereferenced.
PANIC: Unexpected exception "General Protection Exception" occurred in kernel mode! Error code: 0x0

So there are potentially a few bugs here. Reguardless of why tcp_connect() ends up getting called multiple times, EndpointManager::fConnectionHash ends up getting corrupted, which I think is the main bug here. The complete log is here.

Right now I'm working on a fix and a minimal repro for a regression test.

comment:9 by ambroff, 6 months ago

By the way it seems this behavior of gpg that causes it to call connect() twice seems to be a known issue on other platforms.

https://alfredmyers.com/2019/07/23/gpg-keyserver-receive-failed-no-dirmngr/

From that link:

  1. gpg spawns dirmngr
  2. dirmngr attempt to connect to localhost port 9050 (this is attempted twice).
  3. gpg attempts to connect to a unix socket that dirmngr creates after the localhost socket connection fails.
Last edited 6 months ago by ambroff (previous) (diff)

comment:10 by ambroff, 5 months ago

I have added a new test suite for BOpenHashTable<T> in a branch. I found an additional memory-corruption bug while doing so (update: false alarm, I just got a bit confused. I don't think there is a memory corruption bug here).

BOpenHashTable has an AutoExpand non-type template parameter. If set to false, which is not the default, then upon insertion it will:

  1. Check to see if the current buffer size is > 0. If not, it will resize the buffer to the minimum size, which is 8 elements. If the user calls Init(size_t) before inserting they can set this size to a greater value.
  2. If AutoExpand is true, then the buffer will be resized if more room is needed to handle the insertion. When resized to add more capacity, the size is doubled.
  3. Insert with no bounds check.

So currently if someone uses BOpenHashTable<T, false>, and then inserts more than the fixed number of elements defined (either by the default or call to Init(size_t)), then it will write beyond the bounds of the fTable array, smashing some random memory.

BOpenHashTable<T, false> is used in quite a few places, including the ipv4, ipv6, udp and unix add-ons, IOSchedulerSimple and HashedObjectCache. Even if those aren't broken because of this they easily could be.

Last edited 5 months ago by ambroff (previous) (diff)

comment:11 by ambroff, 5 months ago

Unfortunately fixing this properly may result in a pretty large set of changes.

Insert returns status_t, but most callers don't check the result. That's kinda ok though cause we can just focus our effort in code that disabled AutoExpand, which is only a handful.

There are many code paths which call BOpenHashTable<T>::InsertUnchecked, which currently does not return an error at all, but I think that's broken. The API conflates error checking with resizing here. There could be a failure even if resizing is disabled (i.e., there is no more room in the buffer).

I'm planning to submit a patch series that looks something like:

  1. Add test suite for BOpenHashTable template with two failing tests (auto expand disabled test and duplicate insertion test)
  2. Fix the interface of BOpenHashTable so that Insert and InsertUnchecked will return B_NO_MEMORY if auto expansion is disabled and the fTable buffer is full.
  3. Add error handling for all calls to Insert or InsertUnchecked where auto expansion is disabled.

The remaining part is to address the duplicate entries issue that caused the crash in this bug report. The main problem with this API is that it implies that the same item can be inserted multiple times, since the CheckForDuplicates non-type template parameter allows you to specify whether to panic if a duplicate entry is inserted. This is flawed though. While inserting multiple distinct object which could be looked up with the same key is totally possible, inserting the same object multiple times is not because it will create a reference cycle.

  1. Would be changing the duplicate handling logic entirely.

comment:12 by ambroff, 5 months ago

Oops, nevermind about the InsertUnchecked thing. After reading more carefully I remembered that fTable size has nothing to do with the number of elements in the container since each entry is a linked list. So scratch what I said about InsertUnchecked needing to be changed to return B_NO_MEMORY.

The duplicate handling is still real though. Whew that's good, I was worried this was going to end up being a huge change.

comment:13 by ambroff, 5 months ago

I spent a bunch of time bringing BOpenHashTable under test because I was convinced that changing it to handle duplicates was the right thing to do. But now that I've read through that code carefully I don't think that's the right thing to do, since it adds some non-trivial cost for other uses of that container where it isn't necessary.

Looking back at EndpointManager::SetConnection(), I think that the main problem is here:

if (_LookupConnection(*local, peer) != NULL)
	return EADDRINUSE;

endpoint->LocalAddress().SetTo(*local);
endpoint->PeerAddress().SetTo(peer);
T(Connect(endpoint));

fConnectionHash.Insert(endpoint);
return B_OK;

_LookupConnection should return non-null here, but it isn't for some reason. I think there may be a mistake in this method that's causing it to query for an endpoint that is not quite equal to the one being inserted just below that check.

comment:14 by ambroff, 5 months ago

So this is interesting. I added some extra tracing to EndpointManager::SetConnection() and added a panic just as the duplicate TCPEndpoint is inserted.

@@ -259,31 +259,66 @@ status_t
 EndpointManager::SetConnection(TCPEndpoint* endpoint, const sockaddr* _local,
        const sockaddr* peer, const sockaddr* interfaceLocal)
 {
-       TRACE(("EndpointManager::SetConnection(%p)\n", endpoint));
+       TRACE("EndpointManager::SetConnection(%p)\n", endpoint);

        WriteLocker _(fLock);

        SocketAddressStorage local(AddressModule());
        local.SetTo(_local);

+       SocketAddressStorage remote(AddressModule());
+       remote.SetTo(peer);
+       char remote_addr[256];
+       remote.AsString(remote_addr, 256, true);
+
+       char local_addr[256];
+       local.AsString(local_addr, 256, true);
+
+       TRACE("KWA: SetConnection(%p, ...) BEFORE with local=%s and remote=%s\n",
+                 endpoint, local_addr, remote_addr);
+
        if (local.IsEmpty(false)) {
                uint16 port = local.Port();
                local.SetTo(interfaceLocal);
                local.SetPort(port);
        }

-       if (_LookupConnection(*local, peer) != NULL)
+       local.AsString(local_addr, 256, true);
+
+       TRACE("KWA: SetConnection(%p, ...) AFTER with local=%s and remote=%s\n",
+                 endpoint,
+                 local_addr,
+                 remote_addr);
+
+       TCPEndpoint* existing_entry = _LookupConnection(*local, peer);
+       TRACE("KWA: SetConnection(%p) existing entry? %s\n",
+                 endpoint, (existing_entry == NULL) ? "NO" : "YES");
+       if (existing_entry != NULL)
                return EADDRINUSE;

+       existing_entry = _LookupConnection(_local, peer);
+       TRACE("KWA: SetConnection(%p) what about this one? %s\n",
+                 endpoint, (existing_entry == NULL) ? "NO" : "YES");
+
        endpoint->LocalAddress().SetTo(*local);
        endpoint->PeerAddress().SetTo(peer);
        T(Connect(endpoint));

+       existing_entry = _LookupConnection(*local, peer);
+       TRACE("KWA: SetConnection(%p) one more check? %s\n", endpoint,
+                 (existing_entry == NULL) ? "NO" : "YES");
+
+       endpoint->LocalAddress().AsString(local_addr, 256, true);
+       endpoint->PeerAddress().AsString(remote_addr, 256, true);
+       TRACE("KWA: SetConnection(%p, ...) ONEMORE with local=%s and remote=%s\n",
+                 endpoint,
+                 local_addr,
+                 remote_addr);
+
        fConnectionHash.Insert(endpoint);
        return B_OK;

And the output is interesting

EndpointManager::BindToEphemeral(0xffffffff82ade700)
   EndpointManager::BindToEphemeral(0xffffffff82ade700) -> 0.0.0.0:40287
EndpointManager::SetConnection(0xffffffff82ade700)
KWA: SetConnection(0xffffffff82ade700, ...) BEFORE with local=0.0.0.0:40287 and remote=127.0.0.1:9050
KWA: SetConnection(0xffffffff82ade700, ...) AFTER with local=127.0.0.1:40287 and remote=127.0.0.1:9050
KWA: SetConnection(0xffffffff82ade700) existing entry? NO
KWA: SetConnection(0xffffffff82ade700) what about this one? NO
KWA: SetConnection(0xffffffff82ade700) one more check? NO
KWA: SetConnection(0xffffffff82ade700, ...) ONEMORE with local=127.0.0.1:40287 and remote=127.0.0.1:9050
TCP: no matching endpoint!
TCP: Sending RST...
TCP: Received packet corresponds to explicit endpoint 0xffffffff82ade700
EndpointManager::SetConnection(0xffffffff82ade700)
KWA: SetConnection(0xffffffff82ade700, ...) BEFORE with local=127.0.0.1:40287 and remote=127.0.0.1:9150
KWA: SetConnection(0xffffffff82ade700, ...) AFTER with local=127.0.0.1:40287 and remote=127.0.0.1:9150
KWA: SetConnection(0xffffffff82ade700) existing entry? NO
KWA: SetConnection(0xffffffff82ade700) what about this one? NO
KWA: SetConnection(0xffffffff82ade700) one more check? YES
KWA: SetConnection(0xffffffff82ade700, ...) ONEMORE with local=127.0.0.1:40287 and remote=127.0.0.1:9150
PANIC: A duplicate has been inserted!

So I get it now. An attempt is made to connect to 127.0.0.1:9050. Nothing is bound to that port so it fails, and the socket ends up in closed state.

kdebug> tcp_endpoints
-------- TCP Domain 0xffffffff86713478 ---------
   address                 local                  peer   recv-q   send-q        state
0xffffffff82ade700       127.0.0.1:40287        127.0.0.1:9150        0        0       closed
0xffffffff82adde00       10.0.2.15:40055   104.248.198.131:443       31        0    time-wait
0xffffffff82add500       10.0.2.15:40017   104.248.198.131:443       31        0    time-wait
0xffffffff82adcc00       10.0.2.15:40002   104.248.198.131:443       31        0    time-wait

Then the socket is reused to connect to another port, 127.0.0.1:9150. Now EndpointManager::SetConnection() performs a check to see if an existing endpoint exists with the given (local, peer) addresses, and won't find one. But that doesn't mean that the TCPEndpoint isn't already in fConnectionHash.

So the _LookupConnection() will not actually find a match until LocalAddress() and PeerAddress() are updated, which happens *after* the _LookupConnection() but before the fConnectionHash.Insert().

comment:15 by waddlesplash, 5 months ago

But now that I've read through that code carefully I don't think that's the right thing to do, since it adds some non-trivial cost for other uses of that container where it isn't necessary.

If the hash table doesn't support duplicate insertions properly in any case at all, then we probably should just always panic on this case so we detect it rather than corrupting memory, right?

Thanks for spending the time investigating this!!

comment:16 by ambroff, 5 months ago

Thanks for the reply waddlesplash,

If the hash table doesn't support duplicate insertions properly in any case at all, then we probably should just always panic on this case so we detect it rather than corrupting memory, right?

Yeah that's what I wanted to do at first. BOpenHashTable actually has a non-type template parameter "CheckDuplicates" which does exactly that. It defaults to false, but if you set it to true then duplicates are checked upon insertion or removal, and the presence of a duplicate will trap in the debugger (or panic in kernel mode).

So on the surface it seems like the right default is to just remove the CheckDuplicates option and always perform this duplicate check, since allowing duplicates is not possible.

The big problem with this is the design of the BOpenHashTable template, in that it isn't an open hash table at all. It's a chained hash table using the classic linked-list for each bucket approach. That means that if you modify BOpenHashTable to always perform this duplicate check, the cost of insertion goes from O(1) to O(N).

Now this duplicate problem isn't really even that big of an issue for a simpler chained hash table implementation. The problem is that this implementation uses an intrusive linked list, which is the real source of the problem here.

So when I said,

But now that I've read through that code carefully I don't think that's the right thing to do, since it adds some non-trivial cost for other uses of that container where it isn't necessary.

What I meant was that if a user of BOpenHashTable as it is implemented today knows for certain that a duplicate insertion is not possible, then the default of CheckDuplicates=false is advantageous, because it cuts down on the cost of insertion.

It is pretty sketchy though, and I think this data structure is pretty difficult to use correctly.

So there are two issues in play at this point:

As a bandaid for the tcp add-on, I think it makes sense to add a fConnectionHash.RemoveUnchecked(endpoint) right before insertion in SetConnection() to handle the case where sockets are being reused. I'm about to submit a patch that does this.

As a followup I think BOpenHashTable should have some work done, and that should probably be a separate ticket that I can follow. There are a bunch of possible approaches to making this container easier to user.

  1. Make BOpenHashTable a true open hash table. This means potentially more copying as the buffer resizes, but may perform better for some workloads, and removes the need for the intrusive linked list.
  1. OR Maintain the chained hash table design but don't use an intrusive linked list, which would mean more allocations but would also mean more safety.
  2. OR Replace the intrusive linked list with a random access collection like a vector or a search tree. This means that you could maintain some invariants for each bucket, for example that the items are sorted so that you can always find duplicates in < O(N) time.

Thanks for spending the time investigating this!!

Sure! Thanks for all of the hard work you guys have done over the years.

comment:17 by ambroff, 5 months ago

I submitted https://review.haiku-os.org/c/haiku/+/2173 for this issue. I'll create a followup ticket for addressing BOpenHashTable cause I think it's worth fixing to make it easier to use correctly.

comment:18 by ambroff, 5 months ago

I filed a follow-up ticket for BOpenHashTable #15660.

Last edited 5 months ago by ambroff (previous) (diff)

in reply to:  16 comment:19 by korli, 5 months ago

Replying to ambroff:

So on the surface it seems like the right default is to just remove the CheckDuplicates option and always perform this duplicate check, since allowing duplicates is not possible.

If enabling CheckDuplicates would have triggered a panic for this bug, then this might be a good idea to enable CheckDuplicates for debug builds.

comment:20 by ambroff, 5 months ago

This is fixed in hrev53800. This can be closed.

If enabling CheckDuplicates would have triggered a panic for this bug, then this might be a good idea to enable CheckDuplicates for debug builds.

Yeah that's a pretty good idea. I'll push a patch.

comment:21 by waddlesplash, 2 months ago

Resolution: fixed
Status: newclosed

Fix was merged in January.

comment:22 by nielx, 2 months ago

Milestone: UnscheduledR1/beta2

Assign tickets with status=closed and resolution=fixed within the R1/beta2 development window to the R1/beta2 Milestone

Note: See TracTickets for help on using tickets.