Opened 13 months ago

Closed 4 months ago

#18384 closed bug (fixed)

TCP hangs

Reported by: axeld Owned by: axeld
Priority: normal Milestone: R1/beta5
Component: Network & Internet/TCP Version: R1/Development
Keywords: Cc:
Blocked By: Blocking:
Platform: All

Description

Quite some time ago, I started writing a backup tool that I use personally to backup my data since a couple of years. Unfortunately, it never got polished enough to distribute it.

Anyway, I have updated my NAS twice in January, on the second, and then on the 21st. Since then, I have timeouts in TCP send. The problem is 100% reproducible when I send a large amount of single buffers that are about 130 bytes in size. I am now on hrev56921, and the problem persists. I'm not sure if the January 2nd version was already affected. I'm pretty sure, that the November 14th 2022 version was not affected, though.

When I send these buffers, after a while, the call times out, no matter if I choose a timeout of 5 seconds, or 150 seconds.

I don't even remember how I get a stack trace of the waiting kernel thread, so I didn't gather any data beyond that observation. If you need anything from me, I'm willing to relearn, though :-)

Attachments (1)

shredder-2023-07-27.zip (1.3 MB ) - added by axeld 10 months ago.
tcpdump from the server

Download all attachments as: .zip

Change History (8)

comment:1 by pulkomandy, 13 months ago

I guess a tcpdump capture may be a good start for investigating. One thing to look for is the evolution of TCP window sizes (you can graph them using the "statistics" menutin Wireshark for example). We've had cases where the window size ends up being 0 for some reason, and then nothing can be sent anymore.

comment:2 by axeld, 10 months ago

I could reboot an older state on the machine with the server on it, but that didn't solve the issue (kernel from November 2022).

I will attach a tcpdump showing the issue. The server part lives on 192.168.0.48 port 4242, the client on 192.168.0.14 (obviously port 40204). It seems to send a lot of small packets instead of doing local buffering; this might be the trigger of the issue. What Wireshark detects as SMPP protocol is actually not SMPP, but the data packets of my application.

I see lots of duplicated ACKs, and also even retransmissions before it stops completely.

I haven't yet seen any issues moving several GB to and from that machine so far with SCP or SMB, but I haven't monitored these with tcpdump either yet. 'ifconfig' does not show any errors on ethernet level.

by axeld, 10 months ago

Attachment: shredder-2023-07-27.zip added

tcpdump from the server

comment:3 by pulkomandy, 10 months ago

A few things I noticed in Wireshark:

  • Looking at the "window scaling" graph, the time where the traffic stalls is clearly seen. Interestingly we can see some retransmission retries (marked as dots in the graph), but the "in flight bytes" quantity never seems to move
  • Looking at the packets around that time interval, we can see packet number 42752 where 192.168.0.14 decides to use a selective ACK for the first time, first with one range of received packets, then with two and later three ranges of received packets with holes in between
  • 192.168.0.48 replies to this by re-sending the last packet from normal ACKs as expected (relative sequence number 2922770) but this is never acknowledged by the other machine
  • It also sends packet 2931722 which does not seem to be matching with any of the selective ACK ranges from 192.168.0.14. This also has no effect on the ranges acknowledged by 192.168.0.14

As the connexion stalls so shortly after SACK is used for the first time, that certainly raises a question about implementation of that function. It has a history of creating problems, see #13681.

It looks like as soon as we start sending SACK options, the normal mechanism to acknowlege received frames stops working. After that, the transmission will stall since no new packets are ever acknowledged.

But it is also surprising that the machine sends a SACK in the first place. Why does it think that there are such big holes in the received data? The corresponding packets are present in the capture, so, where did they disappear?

comment:4 by axeld, 9 months ago

Thanks for looking into this! The TCP dump was recorded on 192.168.0.48, so it's natural that the packet is not yet missing, but might be missing on the target still.

However, the DUP-ACKs that are first send by 192.168.0.48 are complete bogus, as it has already acknowledged all in-flight data at this point. So this does definitely look like a bug.

I will try to disable SACK support and check if that fixes things. The TCP implementation isn't really that nice to follow, unfortunately.

comment:5 by axeld, 7 months ago

Unless I messed it up, disabling SACK (on one end) did not fix the issue. I'll try to get new TCP dumps with this from both ends.

comment:6 by waddlesplash, 5 months ago

Please retest after hrev57492.

comment:7 by axeld, 4 months ago

Milestone: UnscheduledR1/beta5
Resolution: fixed
Status: newclosed

Nice, this seems to be fixed!

Note: See TracTickets for help on using tickets.