Opened 4 years ago

Closed 4 years ago

Last modified 4 years ago

#12208 closed bug (fixed)

[Network Kit] DHCP is broken since hrev49401

Reported by: diver Owned by: pulkomandy
Priority: normal Milestone: R1/beta1
Component: Kits/Network Kit Version: R1/Development
Keywords: Cc:
Blocked By: Blocking: #11275, #12223, #12235
Has a Patch: no Platform: All

Description

After updating hrev49339 to hrev49401 DHCP stopped working completely. Tested in VirtualBox 4.1.28 and 5.0.0 on two different laptops.

Attachments (9)

syslog (255.3 KB ) - added by vidrep 4 years ago.
listdev (2.1 KB ) - added by vidrep 4 years ago.
screenshot1.png (50.4 KB ) - added by vidrep 4 years ago.
screenshot2.png (45.3 KB ) - added by vidrep 4 years ago.
screenshot3.png (48.3 KB ) - added by vidrep 4 years ago.
screenshot4.png (56.6 KB ) - added by vidrep 4 years ago.
screenshot5.png (99.6 KB ) - added by vidrep 4 years ago.
screenshot6.png (86.7 KB ) - added by vidrep 4 years ago.
1.png (48.6 KB ) - added by kallisti5 4 years ago.
result of BNetworkAddress ToString

Download all attachments as: .zip

Change History (59)

comment:1 by luroh, 4 years ago

Fwiw, a gcc2h hrev49404 build works fine here in VBox 5.0.0.

comment:2 by diver, 4 years ago

I have bridged network in VM settings. And you?

comment:3 by luroh, 4 years ago

Ah, I use NAT, Intel PRO/1000 MT Desktop adapter.

comment:4 by diver, 4 years ago

Ok, NAT works here as well.

comment:5 by vidrep, 4 years ago

DHCP is broken here on real hardware. Manually setting network preferences in terminal does not work either.

comment:6 by diver, 4 years ago

This is what I get when I switch from Bridged mode to NAT:

AEMON 'DHCP': /dev/net/ipro1000/0: Send DHCP_REQUEST for 10.0.2.15 to 255.255.255.255:67
KERN: [ipro1000] (lem) Link is up 1000 Mbps Full Duplex
KERN: /dev/net/ipro1000/0: media change, media 0x900030 quality 1000 speed 1000000000
DAEMON 'DHCP': /dev/net/ipro1000/0: Received DHCP_NACK from 192.168.0.1
DAEMON 'DHCP': /dev/net/ipro1000/0: Send DHCP_DISCOVER to 255.255.255.255:67
DAEMON 'DHCP': /dev/net/ipro1000/0: Received DHCP_OFFER from 192.168.0.1
DAEMON 'DHCP':   your_address: 192.168.0.6
DAEMON 'DHCP':   server: 192.168.0.1
DAEMON 'DHCP':   lease time: 25200 seconds
DAEMON 'DHCP':   nameserver[0]: 192.168.0.1
DAEMON 'DHCP':   gateway: 192.168.0.1
DAEMON 'DHCP':   subnet: 255.255.255.0
DAEMON 'DHCP':   UNKNOWN OPTION 252 (0xfc)
DAEMON 'DHCP': /dev/net/ipro1000/0: Send DHCP_REQUEST for 192.168.0.6 to 255.255.255.255:67
DAEMON 'DHCP': /dev/net/ipro1000/0: Timeout shift: 8 secs (try 1)
DAEMON 'DHCP': /dev/net/ipro1000/0: Send DHCP_REQUEST for 192.168.0.6 to 255.255.255.255:67
DAEMON 'DHCP': /dev/net/ipro1000/0: Timeout shift: 16 secs (try 2)
DAEMON 'DHCP': /dev/net/ipro1000/0: Send DHCP_REQUEST for 192.168.0.6 to 255.255.255.255:67

comment:7 by vidrep, 4 years ago

The network connection is broken because the /boot/systems/settings/network/services file is empty. I replaced it with one I copied from a working Haiku partition and restored network functionality.

comment:8 by vidrep, 4 years ago

After a reboot, I lost network again. Checked my settings - everything looks OK. As soon as I disabled the SSH server, network started working again.

comment:9 by pulkomandy, 4 years ago

Reporting my comments from IRC here: in bridged mode, the DHCP is handled by whatever DHCP server is installed on your real network. In NAT mode, it is handled by vbox.

I suspect there is a timing problem or something similar making DHCP not work well in certain configurations. For example, with one laptop I can connect to my home wifi network, but not with another. And this second laptop will work on some other wifi networks.

A trace of what gets sent and received on the network (you can use tcpdump in Haiku, and tcpdump or wireshark on the host) would be useful. Also, more information about your network: why is Haiku trying the IP 10.0.2.15? Apparently your DHCP server is in 192.168.0.x, we adjust to this, but then the server seems to be ignoring our DHCP_REQUESTs? Or is it that the reply of the server is lost somewhere on the way?

comment:10 by diver, 4 years ago

10.0.2.15 is an address VirtualBox DHCP server (10.0.2.2) is assigning to Haiku guest in NAT mode. 192.168.0.6 is an address my router (192.168.0.1) assigned to Haiku while it was still in bridged mode.

comment:11 by vidrep, 4 years ago

I did a fresh install of hrev49408_gcc2 to a hard drive partition. Attached is a syslog of the first boot after installation. Attached is a listdev of my hardware. Attached screenshot1, screenshot2 and screenshot3 are of the Network Preferences post install Attached screenshot4 is showing the contents of the /boot/system/settings/network directory Note that "interfaces" and "resolv.conf" are both missing.

by vidrep, 4 years ago

Attachment: syslog added

by vidrep, 4 years ago

Attachment: listdev added

by vidrep, 4 years ago

Attachment: screenshot1.png added

by vidrep, 4 years ago

Attachment: screenshot2.png added

by vidrep, 4 years ago

Attachment: screenshot3.png added

by vidrep, 4 years ago

Attachment: screenshot4.png added

comment:12 by vidrep, 4 years ago

I can restore network function by copying resolv.conf from a working Haiku partition, then executing the following commands in terminal:

ifconfig /dev/net/ipro1000/0 192.168.1.67 route add /dev/net/ipro1000/0 default gw 192.168.1.254

Now, the "interfaces" file magically appears in the network settings directory (screenshot5)

by vidrep, 4 years ago

Attachment: screenshot5.png added

comment:13 by vidrep, 4 years ago

After a reboot network is lost again. Opened tcpdump in terminal (screenshot6)

by vidrep, 4 years ago

Attachment: screenshot6.png added

comment:14 by mmu_man, 4 years ago

Same here. Rebooting to the previous state (from the boot menu), editing the Haiku repository file to point to the hrev49400, and doing a pkgman full-sync seems to fix it for me, so it really seems to come from hrev49401.

comment:15 by pulkomandy, 4 years ago

Yes, but why? The DHCP client code does not use getadrinfo, so how can it be affected by this?

A wireshark capture in either case (working/not working) may help understanding the difference.

comment:16 by diver, 4 years ago

Blocking: 12223 added

(In #12223) A dupe of #12208.

in reply to:  1 comment:17 by arfonzo, 4 years ago

Replying to luroh:

Fwiw, a gcc2h hrev49404 build works fine here in VBox 5.0.0.

FWIW, on gcc2h hrev49404 in bridged mode with VirtualBox 5, I get the same problem as reported here. NAT mode works fine for me.

Furthermore, I was unable to define a static IP. I can set it, but no network connection is picked up despite of that.

I also have seen similar behavior to Pulkomandy's comments, where it seems to sometimes work on WiFi, and sometimes not. This is with the same WiFi network, which is fine for all other devices and OSes.

comment:18 by vidrep, 4 years ago

I find if you try to enter a static IP address into the network preferences and hit "apply" the settings do not take. However, if you set it up from the terminal, all is well. Don't switch to DHCP or your settings will be erased again upon reboot.

comment:19 by Luposian, 4 years ago

Can you explain how to set it up from the terminal? So, what, exactly, is causing this issue, if anyone knows? How did it happen?

Last edited 4 years ago by Luposian (previous) (diff)

in reply to:  19 comment:20 by vidrep, 4 years ago

Replying to Luposian:

Can you explain how to set it up from the terminal? So, what, exactly, is causing this issue, if anyone knows? How did it happen?

See the source activity log for hrev49293, where they switched from libbind to netresolv. I believe Axel will be looking into it once the issues with the new launch daemon are sorted out. In the meantime you can use hrev49292, which was the last build before the change.

From the terminal:

ifconfig interface address e.g. ifconfig /dev/net/ipro1000/0 192.168.1.67

route add interface default gw destination address e.g. route add /dev/net/ipro1000/0 default gw 192.168.1.254

You will also need to enter your DHCP server address(s) to /boot/system/settings/network/resolv.conf

comment:21 by waddlesplash, 4 years ago

We've found the bug, and it's a problem in the DHCP client. How this ever worked we have no clue at all. Fix coming soon...

comment:22 by diver, 4 years ago

Blocking: 12235 added

comment:23 by pulkomandy, 4 years ago

waddlesplash: did you really find the bug?

We discussed this on IRC with kallisti5, but I could not get him to upload a capture of the only useful packet: the DHCP OFFER from the server. So, everything is only speculation so far.

We need a complete capture of all the packets in the DHCP negotiation. tcpdump only shows the headers with the default settings, which is not enough. Wireshark screenshots like kallisti5 did are better, but he missed one of the packets (which comes from the server).

comment:24 by tqh, 4 years ago

It fails on NACK on my machine, OpenWRT router.

DAEMON 'DHCP': /dev/net/iprowifi4965/0: Send DHCP_DISCOVER to 255.255.255.255:67
DAEMON 'DHCP': /dev/net/iprowifi4965/0: Received DHCP_OFFER from 192.168.9.1
DAEMON 'DHCP':   your_address: 192.168.9.210
DAEMON 'DHCP':   server: 192.168.9.1
DAEMON 'DHCP':   lease time: 43200 seconds
DAEMON 'DHCP':   renewal time: 21600 seconds
DAEMON 'DHCP':   rebinding time: 37800 seconds
DAEMON 'DHCP':   subnet: 255.255.255.0
DAEMON 'DHCP':   broadcast: 192.168.9.255
DAEMON 'DHCP':   gateway: 192.168.9.1
DAEMON 'DHCP':   nameserver[0]: 192.168.9.1
DAEMON 'DHCP':   domain name: "lan"
DAEMON 'DHCP': /dev/net/iprowifi4965/0: Send DHCP_REQUEST for 192.168.9.210 to 255.255.255.255:67
DAEMON 'DHCP': /dev/net/iprowifi4965/0: Received DHCP_NACK from 192.168.9.1

comment:25 by pulkomandy, 4 years ago

Again, a complete capture of all the packets (DISCOVER, OFFER, REQUEST, and NACK/ACK if any) would be helpful. It can't be that hard to capture 4 ethernet packets with wireshark and upload the pcap file?

comment:26 by tqh, 4 years ago

And here is the DHCP server log:

Jul 27 17:59:51 dnsmasq-dhcp[3318]: 204252940 available DHCP range: 192.168.9.100 -- 192.168.9.249
Jul 27 17:59:51 dnsmasq-dhcp[3318]: 204252940 client provides name: shredder
Jul 27 17:59:51 dnsmasq-dhcp[3318]: 204252940 DHCPDISCOVER(br-lan) c4:85:08:45:36:ca 
Jul 27 17:59:51 dnsmasq-dhcp[3318]: 204252940 tags: lan, br-lan
Jul 27 17:59:51 dnsmasq-dhcp[3318]: 204252940 DHCPOFFER(br-lan) 192.168.9.210 c4:85:08:45:36:ca 
Jul 27 17:59:51 dnsmasq-dhcp[3318]: 204252940 requested options: 1:netmask, 3:router, 6:dns-server, 28:broadcast, 
Jul 27 17:59:51 dnsmasq-dhcp[3318]: 204252940 requested options: 15:domain-name
Jul 27 17:59:51 dnsmasq-dhcp[3318]: 204252940 next server: 192.168.9.1
Jul 27 17:59:51 dnsmasq-dhcp[3318]: 204252940 sent size:  1 option: 53 message-type  2
Jul 27 17:59:51 dnsmasq-dhcp[3318]: 204252940 sent size:  4 option: 54 server-identifier  192.168.9.1
Jul 27 17:59:51 dnsmasq-dhcp[3318]: 204252940 sent size:  4 option: 51 lease-time  12h
Jul 27 17:59:51 dnsmasq-dhcp[3318]: 204252940 sent size:  4 option: 58 T1  6h
Jul 27 17:59:51 dnsmasq-dhcp[3318]: 204252940 sent size:  4 option: 59 T2  10h30m
Jul 27 17:59:51 dnsmasq-dhcp[3318]: 204252940 sent size:  4 option:  1 netmask  255.255.255.0
Jul 27 17:59:51 dnsmasq-dhcp[3318]: 204252940 sent size:  4 option: 28 broadcast  192.168.9.255
Jul 27 17:59:51 dnsmasq-dhcp[3318]: 204252940 sent size:  4 option:  3 router  192.168.9.1
Jul 27 17:59:51 dnsmasq-dhcp[3318]: 204252940 sent size:  4 option:  6 dns-server  192.168.9.1
Jul 27 17:59:51 dnsmasq-dhcp[3318]: 204252940 sent size:  3 option: 15 domain-name  lan
Jul 27 17:59:51 dnsmasq-dhcp[3318]: 204252940 available DHCP range: 192.168.9.100 -- 192.168.9.249
Jul 27 17:59:51 dnsmasq-dhcp[3318]: 204252940 client provides name: shredder
Jul 27 17:59:51 dnsmasq-dhcp[3318]: 204252940 DHCPREQUEST(br-lan) 192.168.9.210 c4:85:08:45:36:ca 
Jul 27 17:59:51 dnsmasq-dhcp[3318]: 204252940 DHCPNAK(br-lan) 192.168.9.210 c4:85:08:45:36:ca wrong server-ID

(Pulkomandy, you are out of the loop. Please be nice instead of just polluting bug reports. The bugs are for logging info on an issue. We are working on it in IRC.)

comment:27 by kallisti5, 4 years ago

Guys, the cast is broken... here's proof:

diff --git a/src/servers/net/DHCPClient.cpp b/src/servers/net/DHCPClient.cpp
index b3d4c1a..d6d1144 100644
--- a/src/servers/net/DHCPClient.cpp
+++ b/src/servers/net/DHCPClient.cpp
@@ -766,6 +766,8 @@ DHCPClient::_ParseOptions(dhcp_message& message, BMessage& address,
                                syslog(LOG_DEBUG, "  server: %s\n",
                                        _AddressToString(data).String());
                                fServer.SetAddress(*(in_addr_t*)data);
+                               syslog(LOG_DEBUG, "  server set: %s\n",
+                                       fServer.ToString().String());
                                break;
 
                        case OPTION_ADDRESS_LEASE_TIME:

result attached

by kallisti5, 4 years ago

Attachment: 1.png added

result of BNetworkAddress ToString

comment:28 by pdziepak, 4 years ago

As I said on IRC, it doesn't prove anything yet. There is also BNetworkAddress involved. The only thing wrong with this cast is that there is a special place in hell reserved for people using C-style casts in C++ code ;)

comment:29 by pdziepak, 4 years ago

It looks like BNetworkAddress::SetTo() and/or constructor may be the culprit. DHCP client construct fServer with AF_INET, nullptr host and DHCP port (c.f. http://cgit.haiku-os.org/haiku/tree/src/servers/net/DHCPClient.cpp#n430 ). Then it expects SetAddress() never to fail. However, SetAddress() expects address family to be set properly which may not happen in the insides of SetTo() try to be more smart than they are expected to be. Someone will more free time will have to investigate this more closely.

comment:30 by tqh, 4 years ago

Verified, setAddress fails on my machine.

comment:31 by tqh, 4 years ago

Changing:

fServer.SetAddress(*(in_addr_t*)data);

to

fServer.SetTo(*(in_addr_t*)data, DHCP_SERVER_PORT);

works around the issue. BNetworkAddressResolver doesn't seem to work the same as before. getaddrinfo() will always return error for a NULL host so that it forgets that family and port has been set in DHCPClient constructor. Other than that BNetworkAddress and BNetworkAddressResolver seems quite complex and doing duplicate InitCheck's.

comment:32 by axeld, 4 years ago

The only thing changed with hrev49401 is the BNetworkAddressResolver backend -- IOW the error (maybe just now revealed) has something to do with that, as it used to work just fine before.

BTW we do have unit tests at least for BNetworkAddress. Please, before you do anything else, enhance those tests to reproduce the bug you are experiencing.

comment:33 by tqh, 4 years ago

Is it as simple as adding NetAddressTest to the image?

comment:34 by pulkomandy, 4 years ago

The unit tests are run inside the build tree, they are not installable.

Start with an Haiku install with a git checkout of Haiku sources.

jam -q unittests will compile the unitTester executable and all the tests as add-ons for it. You may need to add a symlink to libcppunit.so (also built during this process) next to the unitTester executable (IIRC in generated/tests), then run for example unitTester network (the unitTester executable has an help message and can list the available tests).

comment:35 by tqh, 4 years ago

My development environment is in Linux, so if it don't work there or can be installed I don't think I'll be able to do this.

comment:36 by axeld, 4 years ago

How would that be possible? They are unit tests for Haiku, not Linux, so they have to be run from Haiku. They do, of course, also work in a VM, though. So if you are Linux only, you would need to run Haiku in a VM, check out the source tree, and build the unit tests there.

That's how it'll run on our buildbots, too, once we manage to integrate it.

comment:37 by pulkomandy, 4 years ago

They could be built on Linux and included in the generated image so you can run them. Actually it would make sense to do it that way on the buildbots.

comment:38 by tqh, 4 years ago

My point exactly. I run the Haiku build in a VM, so why can't I crosscompile tests as any other app? I'll mess a bit with unit tests, but I'll leave this for anyone with a already working environment for this. I don't want several trees and for ACPI and EFI my current setup is needed.

comment:39 by tqh, 4 years ago

So this in build/jam/UserBuildConfig works for running this bugs unittest:

AddFilesToHaikuImage home tests : UnitTester ;
AddFilesToHaikuImage home tests lib : libcppunit.so libnetapitest.so ;

comment:40 by tqh, 4 years ago

Wrote some unit test for the cases, but they work fine. Puckipedia suggested it is because VM has internet connection. So: Disabled network in QEMU (-net none) unit test takes forever and more tests fail. Disabled network interface in Haiku, more tests fails, but instantly.

Seems to match that some report hanging and some (me) report failed init check.

Not something I'm capable of fixing.

comment:41 by waddlesplash, 4 years ago

Blocking: 11275 added

(In #11275) Duplicate of #12208. Sorry for not noticing this before!

comment:42 by waddlesplash, 4 years ago

With hrev49476, DHCP is now busted in VirtualBox 100% of the time for me. Progress!

comment:43 by waddlesplash, 4 years ago

Actually, wait, no it isn't. It does stick on "Configuring" with the DHCP status page showing nothing, but after a lot of gunk from DHCP goes through the syslog, it works. Weird...?

comment:44 by pulkomandy, 4 years ago

Resolution: fixed
Status: newclosed

Fixed in hrev49477.

comment:45 by diver, 4 years ago

Confirmed fixed. Thanks!

comment:46 by vidrep, 4 years ago

I did a pkgman full-sync to hrev49477 last night (on real hardware) and I still do not have DHCP. I'll download a anyboot image tonight and do a fresh install to another hard drive partition to see if it makes a difference.

comment:47 by pulkomandy, 4 years ago

If you still have problems, please reopen a ticket, as the issue reported here was fixed. Please include all the relevant information:

  • IP address of the server
  • Syslog
  • Wireshark capture of the DHCP packets (you can type "bootp" in the filter field in Wireshark and press "Apply" to get only the DHCP packets) - this can be done from any Linux/Windows/OSX system on the same network since the DHCP packets are sent in broadcast.

comment:48 by diver, 4 years ago

You probably wanted to say - open a new ticket? :)

in reply to:  47 comment:49 by vidrep, 4 years ago

Replying to pulkomandy:

If you still have problems, please reopen a ticket, as the issue reported here was fixed. Please include all the relevant information:

  • IP address of the server
  • Syslog
  • Wireshark capture of the DHCP packets (you can type "bootp" in the filter field in Wireshark and press "Apply" to get only the DHCP packets) - this can be done from any Linux/Windows/OSX system on the same network since the DHCP packets are sent in broadcast.

Thanks for the detailed and concise instructions on how to troubleshoot any potential network issues. I'll give the latest nightly a try. Hopefully, I won't have to report anything further.

comment:50 by vidrep, 4 years ago

Verified working with a fresh install of hrev49477 x86_gcc2. Strange that a pkgman update to the same revision did not work.

Note: See TracTickets for help on using tickets.