Opened 12 months ago

Closed 10 months ago

Last modified 8 months ago

#18759 closed bug (fixed)

TCP crash, possible regression

Reported by: axeld Owned by: axeld
Priority: blocker Milestone: R1/beta5
Component: Network & Internet/TCP Version: R1/Development
Keywords: Cc:
Blocked By: Blocking:
Platform: All

Description

I just booted into my main Haiku VM after an update to todays revision, and was greeted by this KDL.

Attachments (2)

HaikuMainScreenshot-2024-01-19.png (112.9 KB ) - added by axeld 12 months ago.
HaikuScreenshots-2024-03-22.png (107.3 KB ) - added by axeld 10 months ago.
Crash with hrev57659

Download all attachments as: .zip

Change History (17)

comment:1 by axeld, 12 months ago

Milestone: UnscheduledR1/beta5
Priority: normalblocker

Seems to be reproducible 100% on this VM; I got it three times in a row now.

comment:2 by axeld, 12 months ago

In net_socket.cpp, line 792 is an unchecked access to net_socket::parent, which seems to be NULL in this case.

comment:3 by waddlesplash, 12 months ago

But "parent" is a weak pointer, and we then check whether we actually got a reference on the next line. So that does not look like it can be the problem?

It almost looks like the socket itself might be NULL.

comment:4 by axeld, 12 months ago

Yes, indeed.

comment:5 by waddlesplash, 12 months ago

Nothing in _MarkEstablished() invokes socket_aborted directly. This looks like a tail call from something else.

The only thing in the whole tree that calls set_aborted() is TCPEndpoint::_Close. But that invokes has_parent first, so I'd expect that to have crashed instead. So this is strange.

comment:6 by waddlesplash, 10 months ago

Ah, I think I see the problem. There's a major API footgun here.

comment:7 by waddlesplash, 10 months ago

Resolution: fixed
Status: newclosed

Fixed in hrev57645.

comment:8 by axeld, 10 months ago

Resolution: fixed
Status: closedreopened

I guess you only fixed a symptom, although it doesn't happen directly on boot anymore. Soon after it still crashes again. Stacktrace attached.

by axeld, 10 months ago

Crash with hrev57659

comment:9 by waddlesplash, 10 months ago

This looks like a completely different problem; or maybe the two are related but the relation isn't clear. This looks to be a use-after-free.

comment:10 by axeld, 10 months ago

I can open another ticket for it, if you prefer.

comment:11 by waddlesplash, 10 months ago

I guess it doesn't make much difference so we might as well stick with this one.

comment:12 by waddlesplash, 10 months ago

How have we gotten to socket_dequeue_connected from tcp_receive_data? The usual path for accept (from a debug build, so no tail calls):

 7 ffffffff81240cc0 (+  48) ffffffff81ee304b   </boot/system/add-ons/kernel/network/stack> socket_dequeue_connected(net_socket*, net_socket**) + 0xb9
 8 ffffffff81240cf0 (+  80) ffffffff81cef9b3   </boot/system/add-ons/kernel/network/protocols/tcp> TCPEndpoint::Accept(net_socket**) + 0x125
 9 ffffffff81240d40 (+  32) ffffffff81ced70d   </boot/system/add-ons/kernel/network/protocols/tcp> tcp_accept(net_protocol*, net_socket**) + 0x23
10 ffffffff81240d60 (+  64) ffffffff81ee38f2   </boot/system/add-ons/kernel/network/stack> socket_accept[clone .localalias] (net_socket*, sockaddr*, unsigned int*, net_socket**) + 0x52
11 ffffffff81240da0 (+  48) ffffffff81eed127   </boot/system/add-ons/kernel/network/stack> stack_interface_accept(net_socket*, sockaddr*, unsigned int*, net_socket**) + 0x3c
12 ffffffff81240dd0 (+ 112) ffffffff801f8bb4   <kernel_x86_64> common_accept(int, sockaddr*, unsigned int*, bool) + 0x76

comment:13 by waddlesplash, 10 months ago

Actually, an even better question is: how have we gotten to socket_dequeue_connected without a socket_accept appearing anywhere in the stack trace? tcp_accept/TCPEndpoint::Accept isn't called anywhere within the module, and the only thing that calls it in the network module (as far as I can tell) is socket_accept, which has more logic after it calls this hook and couldn't be tail-call-optimized away. So what's going on here?

comment:14 by waddlesplash, 10 months ago

Resolution: fixed
Status: reopenedclosed

Ah, I see the problem: your TCP addon is from "non-packaged". In hrev57408 the size of net_socket_module_info changed as some methods were removed. So that's the issue.

comment:15 by axeld, 8 months ago

Duh, I feel stupid now! Nice catch! I had forgotten about this; I tested turning off SACK. This also fixes my 'git fetch' issues that I did not look into yet, thanks a lot!

Note: See TracTickets for help on using tickets.