Opened 16 months ago

Closed 2 months ago

#13789 closed bug (fixed)

DHCP client loops for a long time for no apparent reason

Reported by: waddlesplash Owned by: waddlesplash
Priority: normal Milestone: Unscheduled
Component: Servers/net_server Version: R1/Development
Keywords: Cc:
Blocked By: Blocking: #10921
Has a Patch: no Platform: All


Excerpt from syslog:

DAEMON 'DHCP': /dev/net/ipro1000/0: Send DHCP_REQUEST to
KERN: Last message repeated 4 times.
DAEMON 'DHCP': /dev/net/ipro1000/0: Received DHCP_OFFER from
DAEMON 'DHCP': /dev/net/ipro1000/0: Received DHCP_NACK from
DAEMON 'DHCP': /dev/net/ipro1000/0: Send DHCP_DISCOVER to
DAEMON 'DHCP': /dev/net/ipro1000/0: Send DHCP_REQUEST to
DAEMON 'DHCP': /dev/net/ipro1000/0: Timeout shift: 500000 secs (try 1)
DAEMON 'DHCP': /dev/net/ipro1000/0: Send DHCP_REQUEST to
DAEMON 'DHCP': /dev/net/ipro1000/0: Received DHCP_OFFER from
DAEMON 'DHCP': /dev/net/ipro1000/0: Received DHCP_NACK from
DAEMON 'DHCP': /dev/net/ipro1000/0: Send DHCP_DISCOVER to
DAEMON 'DHCP': /dev/net/ipro1000/0: Send DHCP_REQUEST to
DAEMON 'DHCP': /dev/net/ipro1000/0: Timeout shift: 500000 secs (try 1)
DAEMON 'DHCP': /dev/net/ipro1000/0: Send DHCP_REQUEST to
DAEMON 'DHCP': /dev/net/ipro1000/0: Received DHCP_OFFER from
DAEMON 'DHCP': /dev/net/ipro1000/0: Received DHCP_NACK from
DAEMON 'DHCP': /dev/net/ipro1000/0: Send DHCP_DISCOVER to

And this continued for ~40s (I note it did not acutally sleep "500000 secs" at any point here) before it finished configuring:

DAEMON 'DHCP': /dev/net/ipro1000/0: Received DHCP_OFFER from
DAEMON 'DHCP': /dev/net/ipro1000/0: Received DHCP_ACK from
DAEMON 'DHCP':   server:
DAEMON 'DHCP':   subnet:
DAEMON 'DHCP':   gateway:
DAEMON 'DHCP':   domain name: ""
DAEMON 'DHCP':   nameserver[0]:
DAEMON 'DHCP':   nameserver[1]:
DAEMON 'DHCP':   lease time: 604800 seconds
DAEMON 'DHCP':   renewal time: 604780 seconds
DAEMON 'DHCP':   rebinding time: 604780 seconds
DAEMON 'DHCP': /dev/net/ipro1000/0: DHCP status = No error

Assigning to tqh as he last changed this code (and may have caused this problem.)

Change History (15)

comment:1 Changed 16 months ago by tqh

Owner: changed from tqh to waddlesplash

Please do not autoassign, you assign it to someone when someone is working on it, not to someone who has the last commit. IMO it's rude. Other ask or tag that person. Also this is about not getting DHCP_ACK, which is more likely a network issue like dropped packets. Which is also more likely since a lot of drivers got updated recently.

comment:2 Changed 16 months ago by waddlesplash

Well, it seems the timeout system you installed is not working, as it clearly is not sleeping for 500000sec, so could you at least let me know what you know about that?

comment:3 Changed 16 months ago by tqh

Are you just assuming it is broken or can you point where in the commit the issue lies? Seriously, do proper investigation or leave the issue until someone has time to take it. Don't go and demand that people with little spare time, will just stop and do it for you.

comment:4 Changed 16 months ago by waddlesplash

Are you just assuming it is broken or can you point where in the commit the issue lies?

The premises that I assumed were as follows:

  • The syslog indicates a timeout of 500000 seconds
  • This timeout clearly did not take place because it immediately tries again
  • tqh was the last person to touch this code, and specifically the timeout section of it

And so from these premises I conclude that one of the following must be true:

  1. tqh's last commit is somehow the culprit (although in this case I don't see how it is)
  2. tqh modified this specific subsystem recently, and thus very likely knows something about the nuances of it

I'm sorry if I'm coming across as "please solve my problem for me", that is not my intent in the slightest. I'm noting that there is a very strange behavior going on, of which there appears to be at the very least one bug (500000 seconds claimed as a timeout -- either the log print is wrong or the timeout selector is wrong) and that you likely know a lot about it, and so perhaps even a 3-4 sentece comment from you could be worth 2-3 hours of my (or someone else's) time to discover the same things.

I guess our miscommunication here stems from "assigning," which I (and I don't think I'm alone in this) take to mean "person who likely knows the most about this code, and possibly either has time to fix or at least time to give guidance and pass the baton to someone else". I suppose I was too ambiguous in my meaning, and I can see how you interpreted that as "I am lazy and want someone to fix this code for me", which was absolutely not my intent, so I'm sorry.

comment:5 Changed 16 months ago by tqh

If you assume bad things like that, then first revert the commit. Do something read the code at least have a clue.

Looked at it, the debug message should say usec or be divided by 1000000. Other than that it tries until socket timeout or MAX_TRIES. Still it's more likely that network is flaky for UDP as you never get the ACK.

comment:6 Changed 16 months ago by tqh

Assigning is usually ordering someone to work on it. At least in any plave I've ever worked. Otherwise you write 'waddlesplash, can you please take a look at this. I think you may know something'. Instead of here, work on this, your code is broken style you use.

comment:7 Changed 16 months ago by waddlesplash

Well, this is open source, nobody really orders anybody around (lol) so I didn't really see a contradiction there. I'll be sure to use CC more often then.

comment:8 Changed 16 months ago by tqh

No but your use of assignment is really incorrect. Assignment is so you know who is working on it, so others don't. And it will take a long time before I have time, so it should be up for grabs instead of set to me. If someone should assign it to someone it is the maintainer for this component or someone who knows the code well.

You definetly don't use assignment for discussion! Then it would be assigments back and forth all the time. Let auto-assignment handle things unless you are sure who is gonna be the one to do it. I get all ACPI ones, and I try to solve them as soon as possible. So stop abusing assignment, use mentions/cc and comments like it is intended.

When you are handling a release-branch and need prompt fixes to be handled then assignment is the right tool..

Please don't just ignore it and lol about it. Learn to use it correctly!

comment:9 Changed 16 months ago by pulkomandy

We are getting NAK from the router/DHCP server, which apparently leads to an immediate retry. The timeout would be used only when there is no reply at all from the router (that's the point of a timeout, right?).

We should probably handle NACK differently, maybe retry immediately the first time (in case the server just wants us to forget the previously assigned address) and then as we get more and more NACKs, wait a little more when retrying. This could be handled in a way similar to the existing timeout system, but it is a different thing, as we are getting messages from the server.

Last edited 16 months ago by pulkomandy (previous) (diff)

comment:10 Changed 16 months ago by tqh

First round has a short timout as it seems DHCP sends requests very early and is a bit of hit or miss. Before it was almost always a 4s penalty before network was up. Timeout will happen if no response in reasonable time (socket timeout), or errors are returned (!=B_OK). for the bigger state handling.

You are probably seeing this shift:

comment:11 Changed 15 months ago by ThomasWinwood

I believe I'm also experiencing this bug, though with the ipro100 driver. I have two pieces of information that may be useful.

Firstly, the security log from my router shows these two messages over and over again.

12/26/2017  06:00:01 sending ACK to
12/26/2017  06:00:01 sending OFFER to

Secondly, the Haiku machine is being assigned multiple IP addresses with flipped bits in the MAC address. 	shredder 	00-08-02-8F-71-13 	shredder 	00-08-02-0F-71-13 	shredder 	00-08-02-0F-71-93

(The correct MAC address is the central one.)

comment:12 Changed 15 months ago by tqh

I believe the ACK is lost in our stack or in the DHCP-client. Most likely the stack. The DHCP-changes was a quick fix to retry faster. Not to fix the bigger problem. I thought it was because of wifi but seems to be general.

comment:13 Changed 8 months ago by waddlesplash

Blocking: 10921 added

comment:14 Changed 8 months ago by waddlesplash

It seems there are some scenarios (Diver mentioned one in #10921, and I think I've seen it too) where restarting net_server fixes this.

comment:15 Changed 2 months ago by waddlesplash

Resolution: fixed
Status: assignedclosed

I haven't seen this in a while (probably the ARP changes indeed fixed the remaining problems), so, closing as fixed. Anyone with still-reproducible issues should open a new ticket.

Note: See TracTickets for help on using tickets.