Opened 3 years ago

Last modified 8 weeks ago

#17058 new bug

[tcp] crashes in EndpointManager::Unbind on power off

Reported by: dasebek Owned by: axeld
Priority: normal Milestone: Unscheduled
Component: Network & Internet/TCP Version: R1/Development
Keywords: Cc:
Blocked By: Blocking:
Platform: All

Description

Occasionally, I get an sshd-related kernel panic during the system shutdown. The panic message says "PANIC: bound endpoint 0x... not in hash!" This happens when I repeatedly connect to Haiku over SSH in order to transfer files (scp, rsync), and, after all SSH connections are closed, I power off the system.

I was using an official nightly build of hrev55184 (x86_64). A screenshot is attached.

Attachments (5)

ssh_crash.png (65.1 KB ) - added by dasebek 3 years ago.
Panic screenshot
Screenshot_2024-02-14_13-45-52.png (33.1 KB ) - added by dovsienko 9 months ago.
hrev56578+95
Screenshot_2024-02-28_16-03-41.png (31.1 KB ) - added by dovsienko 8 months ago.
hrev57609
Screenshot_2024-07-30_19-21-33.png (31.5 KB ) - added by dovsienko 3 months ago.
KDL screenshot
Screenshot_2024-09-10_20-22-11.png (30.9 KB ) - added by dovsienko 8 weeks ago.
KDL screenshot

Download all attachments as: .zip

Change History (23)

by dasebek, 3 years ago

Attachment: ssh_crash.png added

Panic screenshot

comment:1 by diver, 3 years ago

Component: SystemNetwork & Internet/Stack
Owner: changed from nobody to axeld
Summary: [sshd] kernel panic on system power off[tcp] crashes in EndpointManager::Unbind on power off

comment:2 by waddlesplash, 9 months ago

Component: Network & Internet/StackNetwork & Internet/TCP

by dovsienko, 9 months ago

comment:3 by dovsienko, 9 months ago

Reproduced for me once with a very similar stack trace (see screenshot) on Haiku R1 beta 4 hrev56578+95 in VirtualBox 7.0.12. Two SSH connections were open when I initiated a shutdown from the console via ACPI power.

comment:4 by dovsienko, 8 months ago

Reproduced randomly once on hrev57609 after I closed all SSH connections and about a minute later attempted an ACPI power shutdown, which usually works fine. Attaching screenshot...

comment:5 by waddlesplash, 8 months ago

I found a potential problem that could lead to this and fixed it in hrev57616.

comment:6 by dovsienko, 3 months ago

The crash on shutdown has just reproduced again when I did the following:

  1. Boot today's revision of Haiku (I think it is hrev57881).
  2. Open an SSH connection into the VM.
  3. In the SSH connection run "pkgman update", let it update some packages and exit.
  4. Close the SSH connection normally via Ctrl-D at the Linux client.
  5. Initiate an ACPI shutdown via Host-H in VirtualBox.
  6. Observe a KDL prompt (attached).

by dovsienko, 3 months ago

KDL screenshot

comment:7 by dovsienko, 3 months ago

kdebug> tcp_endpoints
tcp_endpoints
-------- TCP Domain 0xffffffff890dc298 ---------
   address                 local                  peer   recv-q   send-q        state
0xffffffff80c6e400            0.0.0.0:22             0.0.0.0:0        0        0       closed
0xffffffff80c68100    192.168.0.27:40053   206.189.242.242:443       24       24    time-wait
0xffffffff80c67e00    192.168.0.27:40002   206.189.242.242:443       24       24    time-wait
-------- TCP Domain 0xffffffff88eec5b0 ---------
   address                 local                  peer   recv-q   send-q        state
kdebug> tcp_endpoints 0xffffffff80c6cf00
tcp_endpoints 0xffffffff80c6cf00
-------- TCP Domain 0xffffffff890dc298 ---------
   address                 local                  peer   recv-q   send-q        state
0xffffffff80c6e400            0.0.0.0:22             0.0.0.0:0        0        0       closed
0xffffffff80c68100    192.168.0.27:40053   206.189.242.242:443       24       24    time-wait
0xffffffff80c67e00    192.168.0.27:40002   206.189.242.242:443       24       24    time-wait
-------- TCP Domain 0xffffffff88eec5b0 ---------
   address                 local                  peer   recv-q   send-q        state

comment:8 by dovsienko, 3 months ago

kdebug> tcp_endpoint 0xffffffff80c6cf00
tcp_endpoint 0xffffffff80c6cf00
TCP endpoint 0xffffffff80c6cf00
  state: syn-received
  flags: 0x83
  lock: { 0xffffffff80c6cf38, holder: 325 }
  accept sem: -858993460
  options: 0x0
  send
    window shift: 0
    unacknowledged: 0
    next: 0
    max: 0
    urgent offset: 0
    window: 0
    max window: 0
    max segment size: 536
    queue: 0 / 65535
    last acknowledge sent: 3435973836
    initial sequence: 0
  receive
    window shift: 0
    next: 0
    max advertised: 0
    window: 65535
    max segment size: 536
    queue: 0 / 65535
    initial sequence: 3435973836
    duplicate acknowledge count: 0
  smoothed round trip time: 0 (deviation 0)
  retransmit timeout: 2000000
  congestion window: 0
  slow start threshold: 0

comment:9 by dovsienko, 3 months ago

kdebug> dl 0xffffffff80c6cf00+24
   dl 0xffffffff80c6cf00+24
[0xffffffff80c6cf18]  .6.......^......   ffffffff88ca3610 ffffffff82145e80

comment:10 by dovsienko, 3 months ago

kdebug> db 0xffffffff80c6cf18+32
db 0xffffffff80c6cf18+32
[0xffffffff80c6cf38]  HC..............   48 43 ac 81 ff ff ff ff 00 00 00 00 00 00 00 00

comment:11 by dovsienko, 3 months ago

kdebug> socket 0xffffffff80c6cf18
socket 0xffffffff80c6cf18
SOCKET 0xffffffff80c6cf18
  family.type.protocol: -2119417016.-1.0

[*** READ FAULT at 0xccccccd8, pc: 0xffffffff80172f1b ***]

comment:12 by dovsienko, 3 months ago

kdebug> dl 0xffffffff80c6cf00+16
dl 0xffffffff80c6cf00+16
[0xffffffff80c6cf10]  .6.......6......   ffffffff88ca3610 ffffffff88ca3610

comment:13 by waddlesplash, 3 months ago

From this and other information in the KDL session it just looks like the socket has been freed already, and so we are using a freed socket with an invalid address.

I will add some more assertions to try and catch this.

comment:14 by waddlesplash, 2 months ago

dovsienko reported the build with the extra asserts didn't seem to trigger anything. It's possible my changes to TCP (especially the one to keepalive timers) somehow fixed this.

Anyway, here's the patch with extra asserts I had created for testing:

diff --git a/src/add-ons/kernel/network/stack/net_socket.cpp b/src/add-ons/kernel/network/stack/net_socket.cpp
index f6d588f509..c0d06ef71c 100644
--- a/src/add-ons/kernel/network/stack/net_socket.cpp
+++ b/src/add-ons/kernel/network/stack/net_socket.cpp
@@ -36,6 +36,8 @@
 #include "ancillary_data.h"
 #include "utility.h"
 
+#include "../../../../system/kernel/slab/slab_debug.h"
+
 
 //#define TRACE_SOCKET
 #ifdef TRACE_SOCKET
@@ -54,6 +56,8 @@ struct net_socket_private : net_socket,
 	net_socket_private();
 	~net_socket_private();
 
+	static	void				operator delete(void* mem) { fill_freed_block(mem, sizeof(net_socket_private)); }
+
 	void RemoveFromParent();
 
 	BWeakReference<net_socket_private> parent;
@@ -128,9 +132,11 @@ net_socket_private::~net_socket_private()
 
 	// also delete all children of this socket
 	while (net_socket_private* child = pending_children.RemoveHead()) {
+		ASSERT(child->parent == this);
 		child->RemoveFromParent();
 	}
 	while (net_socket_private* child = connected_children.RemoveHead()) {
+		ASSERT(child->parent == this);
 		child->RemoveFromParent();
 	}
 
diff --git a/src/add-ons/kernel/network/stack/stack.cpp b/src/add-ons/kernel/network/stack/stack.cpp
index 1efd538432..82391f561d 100644
--- a/src/add-ons/kernel/network/stack/stack.cpp
+++ b/src/add-ons/kernel/network/stack/stack.cpp
@@ -418,6 +418,10 @@ uninit_domain_protocols(net_socket* socket)
 	net_protocol* protocol = socket->first_protocol;
 	while (protocol != NULL) {
 		net_protocol* next = protocol->next;
+
+		ASSERT_PRINT(protocol->socket == socket,
+			"protocol: %p, socket: %p, protocol->socket: %p",
+			protocol, socket, protocol->socket);
 		protocol->module->uninit_protocol(protocol);
 
 		protocol = next;

by dovsienko, 8 weeks ago

KDL screenshot

comment:15 by dovsienko, 8 weeks ago

The problem has just reproduced on the latest nightly build. The steps that lead to it were as follows:

  1. Keep the VM running for some time (possibly open and close some SSH connections, but cannot remember).
  2. Run a terminal in the Haiku desktop.
  3. Open 3 or 4 SSH sessions from a Linux client.
  4. Do some unrelated debugging.
  5. Close the Haiku desktop terminal.
  6. Close all SSH sessions normally (Ctrl-D from the Linux end).
  7. In the last remaining SSH session run "shutdown".
  8. The shutdown proceeds as usual, then nearly at the moment of the power off the KDL prompt appears (attached just before).

comment:16 by dovsienko, 8 weeks ago

kdebug> tcp_endpoint 0xffffffff80c6de00
tcp_endpoint 0xffffffff80c6de00
TCP endpoint 0xffffffff80c6de00
  socket: 0xffffffff87712b70
  state: closed
  flags: 0x0
  lock: { 0xffffffff80c6de38, holder: -1 }
  accept sem: -2017802768
  options: 0xff
  send
    window shift: 255
    unacknowledged: 2272722560
    next: 4294967295
    max: 0
    urgent offset: 0
    window: 0
    max window: 0
    max segment size: 0
    queue: 18446744071690971168 / 18446744071691678296
    last acknowledge sent: 2276219624
    initial sequence: 4294967295
  receive
    window shift: 255
    next: 0
    max advertised: 0
    window: 0
    max segment size: 0
    queue: 0 / 18446744071691673088
    initial sequence: 0
    duplicate acknowledge count: 0
  smoothed round trip time: 0 (deviation -2022282456)
  retransmit timeout: 0
  congestion window: 4294967295
  slow start threshold: 0
kdebug> 
kdebug> socket 0xffffffff87712b70
socket 0xffffffff87712b70
SOCKET 0xffffffff87712b70
  family.type.protocol: 0.0.0

[*** READ FAULT at 0x20000000c, pc: 0xffffffff80175aed ***]
kdebug> 

comment:17 by waddlesplash, 8 weeks ago

kdebug> slab_object 0xffffffff80c6de00
0xffffffff80c6de00 does not seem to be in an object_cache

The socket address was in an object cache, but a packagefs one. It seems the memory was freed and reused.

comment:18 by dovsienko, 8 weeks ago

For reference, that occurrence was on hrev58097.

Note: See TracTickets for help on using tickets.