#5668 closed bug (fixed)
net_server often crashes after boot
Reported by: | luroh | Owned by: | axeld |
---|---|---|---|
Priority: | normal | Milestone: | R1 |
Component: | Kits/Application Kit | Version: | R1/Development |
Keywords: | Cc: | ||
Blocked By: | Blocking: | #4470 | |
Platform: | x86 |
Description
net_server often crashes a few seconds after booting Haiku while testing wifi (ipw3945abg) with no other network interfaces present.
Core Duo laptop, rev 36010, gcc4.
Attachments (10)
Change History (41)
by , 15 years ago
Attachment: | gdb_net_server.txt added |
---|
comment:1 by , 15 years ago
comment:2 by , 15 years ago
Ok, let me know if there is anything I can do to help pin it down. The crash is pretty easily repeatable for me, it occurs about every other boot or so.
comment:3 by , 15 years ago
Revision 36099 and the problem has unfortunately gotten worse - net_server now consistently crashes, presumably at the point where the wifi card is trying to apply the settings received from the access point's DHCP server. Using an older version of setwep does not help.
follow-up: 5 comment:4 by , 15 years ago
Great! Having a reproducible problem is the best start to fix it :-)
There doesn't seem to be any locking problem in the code; in fact the server doesn't play with its locking at all. Can you add debug output to the server that helps finding out who unlocks it?
comment:5 by , 15 years ago
Replying to axeld:
Can you add debug output to the server that helps finding out who unlocks it?
Not without instructions or, preferably, a patch.
comment:7 by , 15 years ago
Another look at it revealed that the net_server isn't even the problem; locking the looper list fails which is pretty strange. Could be memory corruption, but could be a lot else, too.
I'm attaching a bug that hopefully gives a clue to syslog.
by , 15 years ago
Attachment: | looperlist.diff added |
---|
additional debug output in BLooperList::AssertLocked()
by , 15 years ago
Attachment: | gdb_net_server_36237.txt added |
---|
by , 15 years ago
Attachment: | syslog_36237.txt added |
---|
comment:9 by , 15 years ago
Sorry, the two logs from rev 36237 were actually generated with an ipro1000 Ethernet device present but not plugged in. I forgot to remove the symlink. I'll resume proper testing as per the original bug report on Friday when I get back home, where I have access to WEP encrypted wifi (there's only WPA available at my current location).
comment:10 by , 15 years ago
The crash in that syslog look like something ran out of memory, or the BMessage allocator does not work correctly.
comment:11 by , 15 years ago
Rev 36330 and the issue has gotten harder to reproduce. Attaching a syslog created with looperlist.diff applied.
by , 15 years ago
Attachment: | syslog_36330.txt added |
---|
comment:12 by , 14 years ago
Blocking: | 4470 added |
---|
comment:14 by , 14 years ago
I can't get far enough to test it any more. The wifi interface takes its sweet time before it shows up in ifconfig, then:
~> setwep /dev/net/iprowifi3945/0 haikuwifi 0x12345678901234567890123456 set80211: error in handling SIOCS80211 set80211: error in handling SIOCS80211 set80211: error in handling SIOCS80211 set80211: error in handling SIOCS80211 set80211: error in handling SIOCS80211 set80211: error in handling SIOCS80211 set80211: error in handling SIOCS80211 ~>
comment:15 by , 14 years ago
That particular bug has been fixed in the mean time, although I don't understand the connection.
comment:17 by , 14 years ago
Appeared again in 39815, gcc2. Ethernet adapter present but not plugged in when the crash occurred ~60 seconds after reaching desktop. gdb output and incomplete syslog attached.
by , 14 years ago
Attachment: | gdb_net_server_39815.txt added |
---|
by , 14 years ago
Attachment: | syslog_39815.txt added |
---|
comment:18 by , 14 years ago
I get a similar crash with static IP in hrev39925, directly at system start. Backtrace is slightly different.
GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "i586-pc-haiku"...(no debugging symbols found) [tcsetpgrp failed in terminal_inferior: Invalid Argument] Thread 77 called debugger(): looperlist is not locked; proceed at great risk! Reading symbols from /boot/system/runtime_loader...done. Loaded symbols for /boot/system/runtime_loader Reading symbols from /boot/system/lib/libbe.so...done. Loaded symbols for /boot/system/lib/libbe.so Reading symbols from /boot/system/lib/libnetwork.so...done. Loaded symbols for /boot/system/lib/libnetwork.so Reading symbols from /boot/system/lib/libbnetapi.so...done. Loaded symbols for /boot/system/lib/libbnetapi.so Reading symbols from /boot/system/lib/libstdc++.so...done. Loaded symbols for /boot/system/lib/libstdc++.so Reading symbols from /boot/system/lib/libroot.so...done. Loaded symbols for /boot/system/lib/libroot.so Reading symbols from /boot/system/lib/libsupc++.so...done. Loaded symbols for /boot/system/lib/libsupc++.so [tcsetpgrp failed in terminal_inferior: Invalid Argument] [Switching to team /boot/system/servers/net_server (77) thread net_server (77)] 0xffff0114 in ?? () (gdb) bt #0 0xffff0114 in ?? () #1 0x0060a19b in debugger () from /boot/system/lib/libroot.so #2 0x002e2c51 in BPrivate::BLooperList::AssertLocked () from /boot/system/lib/libbe.so #3 0x002e3245 in BPrivate::BLooperList::IsLooperValid () from /boot/system/lib/libbe.so #4 0x002e0a85 in BLooper::IsLocked () from /boot/system/lib/libbe.so #5 0x002e0ad5 in BLooper::AssertLocked () from /boot/system/lib/libbe.so #6 0x002e1538 in BLooper::AddHandler () from /boot/system/lib/libbe.so #7 0x003c6eeb in BPrivate::PathHandler::PathHandler () from /boot/system/lib/libbe.so #8 0x003c70a7 in BPrivate::BPathMonitor::StartWatching () from /boot/system/lib/libbe.so #9 0x0020c3ad in NetServer::ReadyToRun () #10 0x002d78f9 in BApplication::DispatchMessage () from /boot/system/lib/libbe.so #11 0x002e22c4 in BLooper::task_looper () from /boot/system/lib/libbe.so #12 0x002d7eb4 in BApplication::Run () from /boot/system/lib/libbe.so #13 0x0020ad47 in main () (gdb)
follow-up: 21 comment:19 by , 14 years ago
I think these problems are related to Services::_LaunchService() Removing this function makes the crash disappear.
Probably fork() followed by execv() will destroy/modify some semaphores used for locking.
For debbuging purposes, I've been using
#define BLOCKER_ALWAYS_SEMAPHORE_STYLE 1 and #define ENABLE_PARANOIA_CHECKS 1
to build the images.
comment:20 by , 14 years ago
Marcus, that's a very likely cause of the problem, although semaphores of the parent should be unaffected by fork/exec.
comment:21 by , 14 years ago
Replying to marcusoverhagen:
Probably fork() followed by execv() will destroy/modify some semaphores used for locking.
I'd enter KDL and check whether the net server's looper list semaphore still exists and, if so, what its count is. Enabling kernel tracing might help to track the issue further. The usual bunch (syscalls, teams, signals) should be a good starting point.
comment:23 by , 14 years ago
Not sure it's the same bug but hrev41571 just crashed to KDL instead of to GDB with references to net_server, photo attached.
by , 14 years ago
Attachment: | r41571_panic.jpg added |
---|
comment:25 by , 14 years ago
Blocking: | 7665 added |
---|
comment:26 by , 14 years ago
In hrev33050, a similar bug was closed with the following comment Net_server starts services by invoking fork() followed by exec(). If the latter fails (for instance because the service isn't installed), the forked child is invoking exit(). This in turn unloads libbe, triggering static cleanup code in BMessage, which deletes a couple of message ports that were inherited from the parent during the fork.[...]
Doesn't the same problem apply to the BLocker member of global BLooperList gLooperList; in LooperList.cpp? I had used BLOCKER_ALWAYS_SEMAPHORE_STYLE during testing.
What happens to other libraries global port/sem/thread ids after a fork/exec/exit?
comment:27 by , 14 years ago
Global resources are shared with the forked team, but they will remain being owned by the original team; the child can clobber and delete them, but exiting a child won't do anything bad (besides calling the library destructors, it seems).
comment:29 by , 13 years ago
Component: | Servers/net_server → Kits/Application Kit |
---|---|
Resolution: | → fixed |
Status: | new → closed |
Should be fixed in hrev42982. I've only taken care of the Application Kit, though.
comment:30 by , 12 years ago
With no ethernet or wifi connected, net_server crashes a few seconds after the "No link" notifications have appeared. Not sure if this warrants a new ticket, please advise.
by , 12 years ago
Attachment: | gdb_net_server_44620.txt added |
---|
by , 12 years ago
Attachment: | syslog_44620.txt added |
---|
comment:31 by , 7 years ago
Blocking: | 7665 removed |
---|
The stack trace suggests that there is an unbalanced Unlock() call in the code somewhere.