Opened 6 years ago

Closed 5 years ago

Last modified 4 years ago

#13789 closed bug (fixed)

DHCP client loops for a long time for no apparent reason

Reported by: waddlesplash Owned by: waddlesplash
Priority: normal Milestone: R1/beta2
Component: Servers/net_server Version: R1/Development
Keywords: Cc:
Blocked By: Blocking: #10921
Platform: All

Description

Excerpt from syslog:

DAEMON 'DHCP': /dev/net/ipro1000/0: Send DHCP_REQUEST to 255.255.255.255:67
KERN: Last message repeated 4 times.
DAEMON 'DHCP': /dev/net/ipro1000/0: Received DHCP_OFFER from 192.168.0.1
DAEMON 'DHCP': /dev/net/ipro1000/0: Received DHCP_NACK from 192.168.0.1
DAEMON 'DHCP': /dev/net/ipro1000/0: Send DHCP_DISCOVER to 255.255.255.255:67
DAEMON 'DHCP': /dev/net/ipro1000/0: Send DHCP_REQUEST to 255.255.255.255:67
DAEMON 'DHCP': /dev/net/ipro1000/0: Timeout shift: 500000 secs (try 1)
DAEMON 'DHCP': /dev/net/ipro1000/0: Send DHCP_REQUEST to 255.255.255.255:67
DAEMON 'DHCP': /dev/net/ipro1000/0: Received DHCP_OFFER from 192.168.0.1
DAEMON 'DHCP': /dev/net/ipro1000/0: Received DHCP_NACK from 192.168.0.1
DAEMON 'DHCP': /dev/net/ipro1000/0: Send DHCP_DISCOVER to 255.255.255.255:67
DAEMON 'DHCP': /dev/net/ipro1000/0: Send DHCP_REQUEST to 255.255.255.255:67
DAEMON 'DHCP': /dev/net/ipro1000/0: Timeout shift: 500000 secs (try 1)
DAEMON 'DHCP': /dev/net/ipro1000/0: Send DHCP_REQUEST to 255.255.255.255:67
DAEMON 'DHCP': /dev/net/ipro1000/0: Received DHCP_OFFER from 192.168.0.1
DAEMON 'DHCP': /dev/net/ipro1000/0: Received DHCP_NACK from 192.168.0.1
DAEMON 'DHCP': /dev/net/ipro1000/0: Send DHCP_DISCOVER to 255.255.255.255:67

And this continued for ~40s (I note it did not acutally sleep "500000 secs" at any point here) before it finished configuring:

DAEMON 'DHCP': /dev/net/ipro1000/0: Received DHCP_OFFER from 192.168.0.1
DAEMON 'DHCP': /dev/net/ipro1000/0: Received DHCP_ACK from 192.168.0.1
DAEMON 'DHCP':   server: 192.168.0.1
DAEMON 'DHCP':   subnet: 255.255.255.0
DAEMON 'DHCP':   gateway: 192.168.0.1
DAEMON 'DHCP':   domain name: ""
DAEMON 'DHCP':   nameserver[0]: 192.168.1.1
DAEMON 'DHCP':   nameserver[1]: 192.168.0.1
DAEMON 'DHCP':   lease time: 604800 seconds
DAEMON 'DHCP':   renewal time: 604780 seconds
DAEMON 'DHCP':   rebinding time: 604780 seconds
DAEMON 'DHCP': /dev/net/ipro1000/0: DHCP status = No error

Assigning to tqh as he last changed this code (and may have caused this problem.)

Change History (16)

comment:1 by tqh, 6 years ago

Owner: changed from tqh to waddlesplash

Please do not autoassign, you assign it to someone when someone is working on it, not to someone who has the last commit. IMO it's rude. Other ask or tag that person. Also this is about not getting DHCP_ACK, which is more likely a network issue like dropped packets. Which is also more likely since a lot of drivers got updated recently.

comment:2 by waddlesplash, 6 years ago

Well, it seems the timeout system you installed is not working, as it clearly is not sleeping for 500000sec, so could you at least let me know what you know about that?

comment:3 by tqh, 6 years ago

Are you just assuming it is broken or can you point where in the commit the issue lies? Seriously, do proper investigation or leave the issue until someone has time to take it. Don't go and demand that people with little spare time, will just stop and do it for you.

comment:4 by waddlesplash, 6 years ago

Are you just assuming it is broken or can you point where in the commit the issue lies?

The premises that I assumed were as follows:

  • The syslog indicates a timeout of 500000 seconds
  • This timeout clearly did not take place because it immediately tries again
  • tqh was the last person to touch this code, and specifically the timeout section of it

And so from these premises I conclude that one of the following must be true:

  1. tqh's last commit is somehow the culprit (although in this case I don't see how it is)
  2. tqh modified this specific subsystem recently, and thus very likely knows something about the nuances of it

I'm sorry if I'm coming across as "please solve my problem for me", that is not my intent in the slightest. I'm noting that there is a very strange behavior going on, of which there appears to be at the very least one bug (500000 seconds claimed as a timeout -- either the log print is wrong or the timeout selector is wrong) and that you likely know a lot about it, and so perhaps even a 3-4 sentece comment from you could be worth 2-3 hours of my (or someone else's) time to discover the same things.

I guess our miscommunication here stems from "assigning," which I (and I don't think I'm alone in this) take to mean "person who likely knows the most about this code, and possibly either has time to fix or at least time to give guidance and pass the baton to someone else". I suppose I was too ambiguous in my meaning, and I can see how you interpreted that as "I am lazy and want someone to fix this code for me", which was absolutely not my intent, so I'm sorry.

comment:5 by tqh, 6 years ago

If you assume bad things like that, then first revert the commit. Do something read the code at least have a clue.

Looked at it, the debug message should say usec or be divided by 1000000. Other than that it tries until socket timeout or MAX_TRIES. Still it's more likely that network is flaky for UDP as you never get the ACK.

comment:6 by tqh, 6 years ago

Assigning is usually ordering someone to work on it. At least in any plave I've ever worked. Otherwise you write 'waddlesplash, can you please take a look at this. I think you may know something'. Instead of here, work on this, your code is broken style you use.

comment:7 by waddlesplash, 6 years ago

Well, this is open source, nobody really orders anybody around (lol) so I didn't really see a contradiction there. I'll be sure to use CC more often then.

comment:8 by tqh, 6 years ago

No but your use of assignment is really incorrect. Assignment is so you know who is working on it, so others don't. And it will take a long time before I have time, so it should be up for grabs instead of set to me. If someone should assign it to someone it is the maintainer for this component or someone who knows the code well.

You definetly don't use assignment for discussion! Then it would be assigments back and forth all the time. Let auto-assignment handle things unless you are sure who is gonna be the one to do it. I get all ACPI ones, and I try to solve them as soon as possible. So stop abusing assignment, use mentions/cc and comments like it is intended.

When you are handling a release-branch and need prompt fixes to be handled then assignment is the right tool..

Please don't just ignore it and lol about it. Learn to use it correctly!

comment:9 by pulkomandy, 6 years ago

We are getting NAK from the router/DHCP server, which apparently leads to an immediate retry. The timeout would be used only when there is no reply at all from the router.

We should probably handle NAK differently.

Version 0, edited 6 years ago by pulkomandy (next)

comment:10 by tqh, 6 years ago

First round has a short timout as it seems DHCP sends requests very early and is a bit of hit or miss. Before it was almost always a 4s penalty before network was up. Timeout will happen if no response in reasonable time (socket timeout), or errors are returned (!=B_OK).

https://github.com/haiku/haiku/blob/3bbff30d9e5aa58c6f128300c2372db42c840191/src/servers/net/DHCPClient.cpp#L588 for the bigger state handling.

You are probably seeing this shift: https://github.com/haiku/haiku/blob/3bbff30d9e5aa58c6f128300c2372db42c840191/src/servers/net/DHCPClient.cpp#L733

comment:11 by ThomasWinwood, 6 years ago

I believe I'm also experiencing this bug, though with the ipro100 driver. I have two pieces of information that may be useful.

Firstly, the security log from my router shows these two messages over and over again.

12/26/2017  06:00:01 sending ACK to 192.168.2.7
12/26/2017  06:00:01 sending OFFER to 192.168.2.7

Secondly, the Haiku machine is being assigned multiple IP addresses with flipped bits in the MAC address.

192.168.2.6 	shredder 	00-08-02-8F-71-13
192.168.2.7 	shredder 	00-08-02-0F-71-13
192.168.2.8 	shredder 	00-08-02-0F-71-93

(The correct MAC address is the central one.)

comment:12 by tqh, 6 years ago

I believe the ACK is lost in our stack or in the DHCP-client. Most likely the stack. The DHCP-changes was a quick fix to retry faster. Not to fix the bigger problem. I thought it was because of wifi but seems to be general.

comment:13 by waddlesplash, 6 years ago

Blocking: 10921 added

comment:14 by waddlesplash, 6 years ago

It seems there are some scenarios (Diver mentioned one in #10921, and I think I've seen it too) where restarting net_server fixes this.

comment:15 by waddlesplash, 5 years ago

Resolution: fixed
Status: assignedclosed

I haven't seen this in a while (probably the ARP changes indeed fixed the remaining problems), so, closing as fixed. Anyone with still-reproducible issues should open a new ticket.

comment:16 by nielx, 4 years ago

Milestone: UnscheduledR1/beta2

Assign tickets with status=closed and resolution=fixed within the R1/beta2 development window to the R1/beta2 Milestone

Note: See TracTickets for help on using tickets.