Opened 11 years ago

Closed 11 years ago

Last modified 11 years ago

#2682 closed bug (fixed)

Network instability issues since net_timer changes

Reported by: anevilyak Owned by: axeld
Priority: high Milestone: R1/alpha1
Component: Network & Internet/Stack Version: R1/pre-alpha1
Keywords: Cc:
Blocked By: Blocking:
Has a Patch: no Platform: All

Description

Since the rewrites in 26980/26981, I've been noticing some instability issues with the network stack, especially when larger amounts of data are involved. For instance, copying a few hundred megabytes of data via scp is guaranteed to at some point terminate the connection with the following message from sshd:

Disconnecting: Corrupted MAC on input.

After something along those lines has happened at least once, connections seem to destabilize much more quickly afterwards, for instance I see Vision's network threads randomly go dead. I'm not sure what information would be most useful with respect to debugging at this point though.

Change History (19)

comment:1 Changed 11 years ago by anevilyak

Milestone: R1R1/alpha1

comment:2 Changed 11 years ago by anevilyak

Also note this affects svn, a checkout will go dead after a while. I'm now running a current build with only 26980/26981's changes reverted and it's checking out perfectly. So something in those changes is definitely the issue.

comment:3 Changed 11 years ago by bonefish

Priority: normalhigh

I also ran a few svn test checkouts (mostly to play with the I/O stuff, though) with "https://" (hrev27246). One failed with what was apparently a corruption of the protocol data (some XML parser error), though several others succeeded, the only difference being some changed debug output.

comment:4 Changed 11 years ago by anevilyak

That sounds like the sshd problem I was seeing. With svn (via svn+ssh) I'm simply seeing svn just halt in mid checkout, though it responds to ctrl+c ok.

comment:5 Changed 11 years ago by anevilyak

Hi,

On looking through 26980, I notice one thing that might be suspicious... in the loop in wait_for_timer, an entry is created and added to the condition variable, but as far as I can tell, never removed. Would this potentially cause some of the issues we're seeing?

comment:6 Changed 11 years ago by anevilyak

Never mind, I see now that entry.Wait() does the removal itself once done. Sorry for the noise.

comment:7 Changed 11 years ago by anevilyak

Just curious, what kind of ethernet driver are you using? I'm wondering if the timer changes might be impacting something with respect to how the FreeBSD layer handles timing, as I'm using one of those drivers (nforce/if_nfe).

comment:8 Changed 11 years ago by anevilyak

For reference, one of the recent commits seems to have at least partially fixed this problem...I'm not seeing the data corruption errors any more. I am still seeing the "socket goes dead" problem though.

comment:9 Changed 11 years ago by axeld

If you only revert hrev26981, does this already help?

There is one problem (that I see) I introduced in hrev26980: the timer is accessed after it has been removed from the timer list (to remove the TIMER_IS_RUNNING flag). But after the hook has been called, we cannot really know if the timer is still accessible. I'll look into it.

comment:10 Changed 11 years ago by anevilyak

If I remember right, I tried that first and ran into some issue or other, but I don't remember what it was. I'll check again later tonight.

comment:11 Changed 11 years ago by axeld

Does hrev27574 help by any chance?

comment:12 Changed 11 years ago by anevilyak

Will let you know, unfortunately due to timezone differences it will probably be around tomorrow morning your time before I'm able to post results though.

comment:13 Changed 11 years ago by anevilyak

Problem's still visible with 27574, will try reverting 26981 and see if that changes things.

comment:14 Changed 11 years ago by anevilyak

Reverting just 26981 does indeed seem to fix it.

comment:15 Changed 11 years ago by anevilyak

A possibly stupid question: is it necessary and/or possibly problematic to wait on a cancelled timer? I notice ~TCPEndPoint cancels the time wait timer, but then waits on it anyways. Is this intended?

comment:16 Changed 11 years ago by axeld

Resolution: fixed
Status: newclosed

I notice ~TCPEndPoint cancels the time wait timer, but then waits on it anyways. Is this intended?

This is indeed intended: when you cancel a timer, you only make sure it won't be executed anymore, if it's not already running. Waiting for a timer solves the issue when you need to wait until the timer is done executing, for example, when you want to delete it.

Anyway, the problem was that wait_for_timer() would also wait in the context of a timer execution - of course, that will cause a deadlock. It's fixed in hrev27620, thanks for the investigation.

However, I'm not sure the SSH problem is related to this problem. At least the error message surely doesn't fit, and I can't see how this problem could corrupt incoming data.

comment:17 Changed 11 years ago by anevilyak

The SSH problem might actually have been fixed by another commit, I haven't encountered that one again. I am still seeing the dead sockets issue though, albeit not as often as before. I have TCP tracing enabled now though, so will see if that helps figure out what's going wrong.

comment:18 Changed 11 years ago by anevilyak

Ah, I read too quickly, didn't see that you committed another fix. Will try hrev27620. Thanks!

comment:19 Changed 11 years ago by anevilyak

Fix confirmed. Thanks!

Note: See TracTickets for help on using tickets.