Opened 9 years ago

Closed 8 years ago

Last modified 15 months ago

#5668 closed bug (fixed)

net_server often crashes after boot

Reported by: luroh Owned by: axeld
Priority: normal Milestone: R1
Component: Kits/Application Kit Version: R1/Development
Keywords: Cc:
Blocked By: Blocking: #4470
Has a Patch: no Platform: x86

Description

net_server often crashes a few seconds after booting Haiku while testing wifi (ipw3945abg) with no other network interfaces present.

Core Duo laptop, rev 36010, gcc4.

Attachments (10)

gdb_net_server.txt (2.0 KB ) - added by luroh 9 years ago.
looperlist.diff (553 bytes ) - added by axeld 9 years ago.
additional debug output in BLooperList::AssertLocked()
gdb_net_server_36237.txt (1.9 KB ) - added by luroh 9 years ago.
syslog_36237.txt (142.0 KB ) - added by luroh 9 years ago.
syslog_36330.txt (130.5 KB ) - added by luroh 9 years ago.
gdb_net_server_39815.txt (2.0 KB ) - added by luroh 9 years ago.
syslog_39815.txt (487.3 KB ) - added by luroh 9 years ago.
r41571_panic.jpg (298.7 KB ) - added by luroh 8 years ago.
gdb_net_server_44620.txt (2.8 KB ) - added by luroh 7 years ago.
syslog_44620.txt (145.3 KB ) - added by luroh 7 years ago.

Download all attachments as: .zip

Change History (41)

by luroh, 9 years ago

Attachment: gdb_net_server.txt added

comment:1 by stippi, 9 years ago

The stack trace suggests that there is an unbalanced Unlock() call in the code somewhere.

comment:2 by luroh, 9 years ago

Ok, let me know if there is anything I can do to help pin it down. The crash is pretty easily repeatable for me, it occurs about every other boot or so.

comment:3 by luroh, 9 years ago

Revision 36099 and the problem has unfortunately gotten worse - net_server now consistently crashes, presumably at the point where the wifi card is trying to apply the settings received from the access point's DHCP server. Using an older version of setwep does not help.

comment:4 by axeld, 9 years ago

Great! Having a reproducible problem is the best start to fix it :-)

There doesn't seem to be any locking problem in the code; in fact the server doesn't play with its locking at all. Can you add debug output to the server that helps finding out who unlocks it?

in reply to:  4 comment:5 by luroh, 9 years ago

Replying to axeld:

Can you add debug output to the server that helps finding out who unlocks it?

Not without instructions or, preferably, a patch.

comment:6 by axeld, 9 years ago

Okay, thanks, will try to come up with something.

comment:7 by axeld, 9 years ago

Another look at it revealed that the net_server isn't even the problem; locking the looper list fails which is pretty strange. Could be memory corruption, but could be a lot else, too.

I'm attaching a bug that hopefully gives a clue to syslog.

by axeld, 9 years ago

Attachment: looperlist.diff added

additional debug output in BLooperList::AssertLocked()

comment:8 by luroh, 9 years ago

Thank you, attaching gdb output and syslog from revision 36237.

by luroh, 9 years ago

Attachment: gdb_net_server_36237.txt added

by luroh, 9 years ago

Attachment: syslog_36237.txt added

comment:9 by luroh, 9 years ago

Sorry, the two logs from rev 36237 were actually generated with an ipro1000 Ethernet device present but not plugged in. I forgot to remove the symlink. I'll resume proper testing as per the original bug report on Friday when I get back home, where I have access to WEP encrypted wifi (there's only WPA available at my current location).

comment:10 by axeld, 9 years ago

The crash in that syslog look like something ran out of memory, or the BMessage allocator does not work correctly.

comment:11 by luroh, 9 years ago

Rev 36330 and the issue has gotten harder to reproduce. Attaching a syslog created with looperlist.diff applied.

by luroh, 9 years ago

Attachment: syslog_36330.txt added

comment:12 by axeld, 9 years ago

Blocking: 4470 added

(In #4470) Bug #5668 contains better information for this problem.

comment:13 by axeld, 9 years ago

Can you still reproduce the problem?

comment:14 by luroh, 9 years ago

I can't get far enough to test it any more. The wifi interface takes its sweet time before it shows up in ifconfig, then:

~> setwep /dev/net/iprowifi3945/0 haikuwifi 0x12345678901234567890123456
set80211: error in handling SIOCS80211
set80211: error in handling SIOCS80211
set80211: error in handling SIOCS80211
set80211: error in handling SIOCS80211
set80211: error in handling SIOCS80211
set80211: error in handling SIOCS80211
set80211: error in handling SIOCS80211
~>

comment:15 by axeld, 9 years ago

That particular bug has been fixed in the mean time, although I don't understand the connection.

comment:16 by luroh, 9 years ago

This bug seems fixed, I can't repeat it with rev 38286.

comment:17 by luroh, 9 years ago

Appeared again in 39815, gcc2. Ethernet adapter present but not plugged in when the crash occurred ~60 seconds after reaching desktop. gdb output and incomplete syslog attached.

by luroh, 9 years ago

Attachment: gdb_net_server_39815.txt added

by luroh, 9 years ago

Attachment: syslog_39815.txt added

comment:18 by marcusoverhagen, 9 years ago

I get a similar crash with static IP in hrev39925, directly at system start. Backtrace is slightly different.

GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i586-pc-haiku"...(no debugging symbols found)

[tcsetpgrp failed in terminal_inferior: Invalid Argument]
Thread 77 called debugger(): looperlist is not locked; proceed at great risk!
Reading symbols from /boot/system/runtime_loader...done.
Loaded symbols for /boot/system/runtime_loader
Reading symbols from /boot/system/lib/libbe.so...done.
Loaded symbols for /boot/system/lib/libbe.so
Reading symbols from /boot/system/lib/libnetwork.so...done.
Loaded symbols for /boot/system/lib/libnetwork.so
Reading symbols from /boot/system/lib/libbnetapi.so...done.
Loaded symbols for /boot/system/lib/libbnetapi.so
Reading symbols from /boot/system/lib/libstdc++.so...done.
Loaded symbols for /boot/system/lib/libstdc++.so
Reading symbols from /boot/system/lib/libroot.so...done.
Loaded symbols for /boot/system/lib/libroot.so
Reading symbols from /boot/system/lib/libsupc++.so...done.
Loaded symbols for /boot/system/lib/libsupc++.so
[tcsetpgrp failed in terminal_inferior: Invalid Argument]
[Switching to team /boot/system/servers/net_server (77) thread net_server (77)]
0xffff0114 in ?? ()
(gdb) bt
#0  0xffff0114 in ?? ()
#1  0x0060a19b in debugger () from /boot/system/lib/libroot.so
#2  0x002e2c51 in BPrivate::BLooperList::AssertLocked () from /boot/system/lib/libbe.so
#3  0x002e3245 in BPrivate::BLooperList::IsLooperValid () from /boot/system/lib/libbe.so
#4  0x002e0a85 in BLooper::IsLocked () from /boot/system/lib/libbe.so
#5  0x002e0ad5 in BLooper::AssertLocked () from /boot/system/lib/libbe.so
#6  0x002e1538 in BLooper::AddHandler () from /boot/system/lib/libbe.so
#7  0x003c6eeb in BPrivate::PathHandler::PathHandler () from /boot/system/lib/libbe.so
#8  0x003c70a7 in BPrivate::BPathMonitor::StartWatching () from /boot/system/lib/libbe.so
#9  0x0020c3ad in NetServer::ReadyToRun ()
#10 0x002d78f9 in BApplication::DispatchMessage () from /boot/system/lib/libbe.so
#11 0x002e22c4 in BLooper::task_looper () from /boot/system/lib/libbe.so
#12 0x002d7eb4 in BApplication::Run () from /boot/system/lib/libbe.so
#13 0x0020ad47 in main ()
(gdb)

comment:19 by marcusoverhagen, 9 years ago

I think these problems are related to Services::_LaunchService() Removing this function makes the crash disappear.

Probably fork() followed by execv() will destroy/modify some semaphores used for locking.

For debbuging purposes, I've been using

#define BLOCKER_ALWAYS_SEMAPHORE_STYLE 1 and #define ENABLE_PARANOIA_CHECKS 1

to build the images.

comment:20 by axeld, 9 years ago

Marcus, that's a very likely cause of the problem, although semaphores of the parent should be unaffected by fork/exec.

in reply to:  19 comment:21 by bonefish, 9 years ago

Replying to marcusoverhagen:

Probably fork() followed by execv() will destroy/modify some semaphores used for locking.

I'd enter KDL and check whether the net server's looper list semaphore still exists and, if so, what its count is. Enabling kernel tracing might help to track the issue further. The usual bunch (syscalls, teams, signals) should be a good starting point.

comment:22 by korli, 8 years ago

Any news on this one? Still reproducible?

comment:23 by luroh, 8 years ago

Not sure it's the same bug but hrev41571 just crashed to KDL instead of to GDB with references to net_server, photo attached.

by luroh, 8 years ago

Attachment: r41571_panic.jpg added

comment:24 by luroh, 8 years ago

That panic actually fits better with ticket #5711.

comment:25 by scottmc, 8 years ago

Blocking: 7665 added

comment:26 by marcusoverhagen, 8 years ago

In hrev33050, a similar bug was closed with the following comment Net_server starts services by invoking fork() followed by exec(). If the latter fails (for instance because the service isn't installed), the forked child is invoking exit(). This in turn unloads libbe, triggering static cleanup code in BMessage, which deletes a couple of message ports that were inherited from the parent during the fork.[...]

Doesn't the same problem apply to the BLocker member of global BLooperList gLooperList; in LooperList.cpp? I had used BLOCKER_ALWAYS_SEMAPHORE_STYLE during testing.

What happens to other libraries global port/sem/thread ids after a fork/exec/exit?

Last edited 8 years ago by marcusoverhagen (previous) (diff)

comment:27 by axeld, 8 years ago

Global resources are shared with the forked team, but they will remain being owned by the original team; the child can clobber and delete them, but exiting a child won't do anything bad (besides calling the library destructors, it seems).

comment:28 by axeld, 8 years ago

So to answer your question: yes, it should have the same problem.

comment:29 by axeld, 8 years ago

Component: Servers/net_serverKits/Application Kit
Resolution: fixed
Status: newclosed

Should be fixed in hrev42982. I've only taken care of the Application Kit, though.

comment:30 by luroh, 7 years ago

With no ethernet or wifi connected, net_server crashes a few seconds after the "No link" notifications have appeared. Not sure if this warrants a new ticket, please advise.

by luroh, 7 years ago

Attachment: gdb_net_server_44620.txt added

by luroh, 7 years ago

Attachment: syslog_44620.txt added

comment:31 by waddlesplash, 15 months ago

Blocking: 7665 removed
Note: See TracTickets for help on using tickets.