Opened 3 years ago
Last modified 8 weeks ago
#17058 new bug
[tcp] crashes in EndpointManager::Unbind on power off
Reported by: | dasebek | Owned by: | axeld |
---|---|---|---|
Priority: | normal | Milestone: | Unscheduled |
Component: | Network & Internet/TCP | Version: | R1/Development |
Keywords: | Cc: | ||
Blocked By: | Blocking: | ||
Platform: | All |
Description
Occasionally, I get an sshd-related kernel panic during the system shutdown. The panic message says "PANIC: bound endpoint 0x... not in hash!" This happens when I repeatedly connect to Haiku over SSH in order to transfer files (scp, rsync), and, after all SSH connections are closed, I power off the system.
I was using an official nightly build of hrev55184 (x86_64). A screenshot is attached.
Attachments (5)
Change History (23)
by , 3 years ago
Attachment: | ssh_crash.png added |
---|
comment:1 by , 3 years ago
Component: | System → Network & Internet/Stack |
---|---|
Owner: | changed from | to
Summary: | [sshd] kernel panic on system power off → [tcp] crashes in EndpointManager::Unbind on power off |
comment:2 by , 9 months ago
Component: | Network & Internet/Stack → Network & Internet/TCP |
---|
comment:3 by , 9 months ago
Reproduced for me once with a very similar stack trace (see screenshot) on Haiku R1 beta 4 hrev56578+95 in VirtualBox 7.0.12. Two SSH connections were open when I initiated a shutdown from the console via ACPI power.
comment:4 by , 8 months ago
Reproduced randomly once on hrev57609 after I closed all SSH connections and about a minute later attempted an ACPI power shutdown, which usually works fine. Attaching screenshot...
comment:5 by , 8 months ago
I found a potential problem that could lead to this and fixed it in hrev57616.
comment:6 by , 3 months ago
The crash on shutdown has just reproduced again when I did the following:
- Boot today's revision of Haiku (I think it is hrev57881).
- Open an SSH connection into the VM.
- In the SSH connection run "pkgman update", let it update some packages and exit.
- Close the SSH connection normally via Ctrl-D at the Linux client.
- Initiate an ACPI shutdown via Host-H in VirtualBox.
- Observe a KDL prompt (attached).
comment:7 by , 3 months ago
kdebug> tcp_endpoints tcp_endpoints -------- TCP Domain 0xffffffff890dc298 --------- address local peer recv-q send-q state 0xffffffff80c6e400 0.0.0.0:22 0.0.0.0:0 0 0 closed 0xffffffff80c68100 192.168.0.27:40053 206.189.242.242:443 24 24 time-wait 0xffffffff80c67e00 192.168.0.27:40002 206.189.242.242:443 24 24 time-wait -------- TCP Domain 0xffffffff88eec5b0 --------- address local peer recv-q send-q state kdebug> tcp_endpoints 0xffffffff80c6cf00 tcp_endpoints 0xffffffff80c6cf00 -------- TCP Domain 0xffffffff890dc298 --------- address local peer recv-q send-q state 0xffffffff80c6e400 0.0.0.0:22 0.0.0.0:0 0 0 closed 0xffffffff80c68100 192.168.0.27:40053 206.189.242.242:443 24 24 time-wait 0xffffffff80c67e00 192.168.0.27:40002 206.189.242.242:443 24 24 time-wait -------- TCP Domain 0xffffffff88eec5b0 --------- address local peer recv-q send-q state
comment:8 by , 3 months ago
kdebug> tcp_endpoint 0xffffffff80c6cf00 tcp_endpoint 0xffffffff80c6cf00 TCP endpoint 0xffffffff80c6cf00 state: syn-received flags: 0x83 lock: { 0xffffffff80c6cf38, holder: 325 } accept sem: -858993460 options: 0x0 send window shift: 0 unacknowledged: 0 next: 0 max: 0 urgent offset: 0 window: 0 max window: 0 max segment size: 536 queue: 0 / 65535 last acknowledge sent: 3435973836 initial sequence: 0 receive window shift: 0 next: 0 max advertised: 0 window: 65535 max segment size: 536 queue: 0 / 65535 initial sequence: 3435973836 duplicate acknowledge count: 0 smoothed round trip time: 0 (deviation 0) retransmit timeout: 2000000 congestion window: 0 slow start threshold: 0
comment:9 by , 3 months ago
kdebug> dl 0xffffffff80c6cf00+24 dl 0xffffffff80c6cf00+24 [0xffffffff80c6cf18] .6.......^...... ffffffff88ca3610 ffffffff82145e80
comment:10 by , 3 months ago
kdebug> db 0xffffffff80c6cf18+32 db 0xffffffff80c6cf18+32 [0xffffffff80c6cf38] HC.............. 48 43 ac 81 ff ff ff ff 00 00 00 00 00 00 00 00
comment:11 by , 3 months ago
kdebug> socket 0xffffffff80c6cf18 socket 0xffffffff80c6cf18 SOCKET 0xffffffff80c6cf18 family.type.protocol: -2119417016.-1.0 [*** READ FAULT at 0xccccccd8, pc: 0xffffffff80172f1b ***]
comment:12 by , 3 months ago
kdebug> dl 0xffffffff80c6cf00+16 dl 0xffffffff80c6cf00+16 [0xffffffff80c6cf10] .6.......6...... ffffffff88ca3610 ffffffff88ca3610
comment:13 by , 3 months ago
From this and other information in the KDL session it just looks like the socket has been freed already, and so we are using a freed socket with an invalid address.
I will add some more assertions to try and catch this.
comment:14 by , 2 months ago
dovsienko reported the build with the extra asserts didn't seem to trigger anything. It's possible my changes to TCP (especially the one to keepalive timers) somehow fixed this.
Anyway, here's the patch with extra asserts I had created for testing:
diff --git a/src/add-ons/kernel/network/stack/net_socket.cpp b/src/add-ons/kernel/network/stack/net_socket.cpp index f6d588f509..c0d06ef71c 100644 --- a/src/add-ons/kernel/network/stack/net_socket.cpp +++ b/src/add-ons/kernel/network/stack/net_socket.cpp @@ -36,6 +36,8 @@ #include "ancillary_data.h" #include "utility.h" +#include "../../../../system/kernel/slab/slab_debug.h" + //#define TRACE_SOCKET #ifdef TRACE_SOCKET @@ -54,6 +56,8 @@ struct net_socket_private : net_socket, net_socket_private(); ~net_socket_private(); + static void operator delete(void* mem) { fill_freed_block(mem, sizeof(net_socket_private)); } + void RemoveFromParent(); BWeakReference<net_socket_private> parent; @@ -128,9 +132,11 @@ net_socket_private::~net_socket_private() // also delete all children of this socket while (net_socket_private* child = pending_children.RemoveHead()) { + ASSERT(child->parent == this); child->RemoveFromParent(); } while (net_socket_private* child = connected_children.RemoveHead()) { + ASSERT(child->parent == this); child->RemoveFromParent(); } diff --git a/src/add-ons/kernel/network/stack/stack.cpp b/src/add-ons/kernel/network/stack/stack.cpp index 1efd538432..82391f561d 100644 --- a/src/add-ons/kernel/network/stack/stack.cpp +++ b/src/add-ons/kernel/network/stack/stack.cpp @@ -418,6 +418,10 @@ uninit_domain_protocols(net_socket* socket) net_protocol* protocol = socket->first_protocol; while (protocol != NULL) { net_protocol* next = protocol->next; + + ASSERT_PRINT(protocol->socket == socket, + "protocol: %p, socket: %p, protocol->socket: %p", + protocol, socket, protocol->socket); protocol->module->uninit_protocol(protocol); protocol = next;
comment:15 by , 8 weeks ago
The problem has just reproduced on the latest nightly build. The steps that lead to it were as follows:
- Keep the VM running for some time (possibly open and close some SSH connections, but cannot remember).
- Run a terminal in the Haiku desktop.
- Open 3 or 4 SSH sessions from a Linux client.
- Do some unrelated debugging.
- Close the Haiku desktop terminal.
- Close all SSH sessions normally (Ctrl-D from the Linux end).
- In the last remaining SSH session run "shutdown".
- The shutdown proceeds as usual, then nearly at the moment of the power off the KDL prompt appears (attached just before).
comment:16 by , 8 weeks ago
kdebug> tcp_endpoint 0xffffffff80c6de00 tcp_endpoint 0xffffffff80c6de00 TCP endpoint 0xffffffff80c6de00 socket: 0xffffffff87712b70 state: closed flags: 0x0 lock: { 0xffffffff80c6de38, holder: -1 } accept sem: -2017802768 options: 0xff send window shift: 255 unacknowledged: 2272722560 next: 4294967295 max: 0 urgent offset: 0 window: 0 max window: 0 max segment size: 0 queue: 18446744071690971168 / 18446744071691678296 last acknowledge sent: 2276219624 initial sequence: 4294967295 receive window shift: 255 next: 0 max advertised: 0 window: 0 max segment size: 0 queue: 0 / 18446744071691673088 initial sequence: 0 duplicate acknowledge count: 0 smoothed round trip time: 0 (deviation -2022282456) retransmit timeout: 0 congestion window: 4294967295 slow start threshold: 0 kdebug> kdebug> socket 0xffffffff87712b70 socket 0xffffffff87712b70 SOCKET 0xffffffff87712b70 family.type.protocol: 0.0.0 [*** READ FAULT at 0x20000000c, pc: 0xffffffff80175aed ***] kdebug>
comment:17 by , 8 weeks ago
kdebug> slab_object 0xffffffff80c6de00 0xffffffff80c6de00 does not seem to be in an object_cache
The socket address was in an object cache, but a packagefs one. It seems the memory was freed and reused.
Panic screenshot