#2682 closed bug (fixed)
Network instability issues since net_timer changes
Reported by: | anevilyak | Owned by: | axeld |
---|---|---|---|
Priority: | high | Milestone: | R1/alpha1 |
Component: | Network & Internet/Stack | Version: | R1/pre-alpha1 |
Keywords: | Cc: | ||
Blocked By: | Blocking: | ||
Platform: | All |
Description
Since the rewrites in 26980/26981, I've been noticing some instability issues with the network stack, especially when larger amounts of data are involved. For instance, copying a few hundred megabytes of data via scp is guaranteed to at some point terminate the connection with the following message from sshd:
Disconnecting: Corrupted MAC on input.
After something along those lines has happened at least once, connections seem to destabilize much more quickly afterwards, for instance I see Vision's network threads randomly go dead. I'm not sure what information would be most useful with respect to debugging at this point though.
Change History (19)
comment:1 by , 16 years ago
Milestone: | R1 → R1/alpha1 |
---|
comment:2 by , 16 years ago
comment:3 by , 16 years ago
Priority: | normal → high |
---|
comment:4 by , 16 years ago
That sounds like the sshd problem I was seeing. With svn (via svn+ssh) I'm simply seeing svn just halt in mid checkout, though it responds to ctrl+c ok.
comment:5 by , 16 years ago
Hi,
On looking through 26980, I notice one thing that might be suspicious... in the loop in wait_for_timer, an entry is created and added to the condition variable, but as far as I can tell, never removed. Would this potentially cause some of the issues we're seeing?
comment:6 by , 16 years ago
Never mind, I see now that entry.Wait() does the removal itself once done. Sorry for the noise.
comment:7 by , 16 years ago
Just curious, what kind of ethernet driver are you using? I'm wondering if the timer changes might be impacting something with respect to how the FreeBSD layer handles timing, as I'm using one of those drivers (nforce/if_nfe).
comment:8 by , 16 years ago
For reference, one of the recent commits seems to have at least partially fixed this problem...I'm not seeing the data corruption errors any more. I am still seeing the "socket goes dead" problem though.
comment:9 by , 16 years ago
If you only revert hrev26981, does this already help?
There is one problem (that I see) I introduced in hrev26980: the timer is accessed after it has been removed from the timer list (to remove the TIMER_IS_RUNNING flag). But after the hook has been called, we cannot really know if the timer is still accessible. I'll look into it.
comment:10 by , 16 years ago
If I remember right, I tried that first and ran into some issue or other, but I don't remember what it was. I'll check again later tonight.
comment:12 by , 16 years ago
Will let you know, unfortunately due to timezone differences it will probably be around tomorrow morning your time before I'm able to post results though.
comment:13 by , 16 years ago
Problem's still visible with 27574, will try reverting 26981 and see if that changes things.
comment:15 by , 16 years ago
A possibly stupid question: is it necessary and/or possibly problematic to wait on a cancelled timer? I notice ~TCPEndPoint cancels the time wait timer, but then waits on it anyways. Is this intended?
comment:16 by , 16 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
I notice ~TCPEndPoint cancels the time wait timer, but then waits on it anyways. Is this intended?
This is indeed intended: when you cancel a timer, you only make sure it won't be executed anymore, if it's not already running. Waiting for a timer solves the issue when you need to wait until the timer is done executing, for example, when you want to delete it.
Anyway, the problem was that wait_for_timer() would also wait in the context of a timer execution - of course, that will cause a deadlock. It's fixed in hrev27620, thanks for the investigation.
However, I'm not sure the SSH problem is related to this problem. At least the error message surely doesn't fit, and I can't see how this problem could corrupt incoming data.
comment:17 by , 16 years ago
The SSH problem might actually have been fixed by another commit, I haven't encountered that one again. I am still seeing the dead sockets issue though, albeit not as often as before. I have TCP tracing enabled now though, so will see if that helps figure out what's going wrong.
comment:18 by , 16 years ago
Ah, I read too quickly, didn't see that you committed another fix. Will try hrev27620. Thanks!
Also note this affects svn, a checkout will go dead after a while. I'm now running a current build with only 26980/26981's changes reverted and it's checking out perfectly. So something in those changes is definitely the issue.