#9377 closed bug (fixed)
Network Stack Deadlock
Reported by: | jackburton | Owned by: | axeld |
---|---|---|---|
Priority: | normal | Milestone: | R1 |
Component: | Network & Internet/Stack | Version: | R1/Development |
Keywords: | Cc: | ||
Blocked By: | Blocking: | ||
Platform: | All |
Description (last modified by )
Yesterday I was doing some tests with the network stack, and after a while, I experienced a deadlock. ifconfig was locked and couldn't be killed (even with kill -9). net_server was locked, too.
I then entered KDL, and had a look at the threads involved.
There was a deadlock between ifconfig and wpa_supplicant. The net_interfaces mutex was held by wpa_supplicant, which was waiting on the mutex of the wireless interface, which in turn was held by ifconfig.
So I guess we have a lock inversion somewhere.
It's reproducible here 100% of the times by doing:
ifconfig /dev/net/<wireless_card> down
while wpa_supplicant is scanning the available networks.
I have an ipro3945 wireless card.
Attachments (6)
Change History (30)
comment:1 by , 12 years ago
Version: | R1/alpha4.1 → R1/Development |
---|
by , 12 years ago
Attachment: | IMG_20130119_202005.jpg added |
---|
by , 12 years ago
Attachment: | IMG_20130119_202045.jpg added |
---|
by , 12 years ago
Attachment: | IMG_20130119_202455.jpg added |
---|
by , 12 years ago
Attachment: | IMG_20130119_202531.jpg added |
---|
comment:2 by , 12 years ago
Description: | modified (diff) |
---|
comment:4 by , 12 years ago
Would it be possible to unlock the interface list mutex after getting the interface in get_interface(), before calling CreateDomainDatalinkIfNeeded() ?
I mean here:
Interface* get_interface(net_domain* domain, const char* name) { MutexLocker locker(sLock); Interface* interface = find_interface(name); if (interface == NULL) return NULL; + locker.Unlock(); if (interface->CreateDomainDatalinkIfNeeded(domain) != B_OK) return NULL; interface->AcquireReference(); return interface; }
At least, that function doesn't seem to mess with the interfaces list, so it would seem to be safe to unlock.
comment:5 by , 12 years ago
That does not work, unfortunately.
First of all, a second get_interface() for the same name might return an invalid interface (ie. one that hasn't been completely initialized), or CreateDomainDatalinkIfNeeded() might be called twice.
Second of all, if you do not acquire a reference before losing the lock, the interface might be deleted by someone else before you do.
In general, I think the locking order "all interfaces" -> "single interface" is more natural than the other way around. Maybe SetDown()
can be changed to not hold the interface lock when calling into the interface code. If that isn't possible at all (I don't remember), another way would be to lock the interfaces lock, too (in the right order).
comment:6 by , 12 years ago
I went with the second solution. Here's a patch which converts the net_interfaces mutex to a recursive_lock, so that it can be acquired in Interface::SetDown(). This avoids the deadlock. Please someone review the patch.
by , 12 years ago
Attachment: | 0001-Converted-the-net_interfaces-mutex-into-a-recursive-.patch added |
---|
comment:7 by , 12 years ago
patch: | 0 → 1 |
---|
follow-up: 10 comment:8 by , 12 years ago
Not my preferred solution, but it looks good. Maybe add a TODO comment to SetDown() that suggests that a better solution would be to prevent locking when calling the lower layers?
Thanks for working on this!
comment:10 by , 10 years ago
Replying to axeld:
Not my preferred solution, but it looks good. Maybe add a TODO comment to SetDown() that suggests that a better solution would be to prevent locking when calling the lower layers?
How would you avoid locking in this case ?
follow-ups: 12 13 comment:11 by , 10 years ago
Hm, why didn't you ask me 19 months ago? :-)
By looking over the code, I noticed that Interface::_SetUp() has the same issue. I think the best solution would be to use a similar approach than the VFS when vnodes are created:
- lock all interfaces
- create the interface, and put it in there
- mark the interface as busy
- unlock all interfaces
- call into the interface protocol layer as needed (ie. interface_up()/interface_down())
- lock all interfaces again
- remove the busy flag
- unlock all interfaces
Making the busy flag atomic would save locking all interfaces twice. Of course, get_interface() would then need to check for the busy flag, and not return those.
If that fails (maybe because the up/down path calls get_interface() on itself, and needs it to succeed), the change will get a bit more complicated, than this, though.
follow-up: 14 comment:12 by , 10 years ago
Replying to axeld:
Hm, why didn't you ask me 19 months ago? :-)
Dunno, maybe I thought it would've been too complicated, and I just wanted a quick fix (for myself).
By looking over the code, I noticed that Interface::_SetUp() has the same issue. I think the best solution would be to use a similar approach than the VFS when vnodes are created:
- lock all interfaces
- create the interface, and put it in there
- mark the interface as busy
- unlock all interfaces
- call into the interface protocol layer as needed (ie. interface_up()/interface_down())
- lock all interfaces again
- remove the busy flag
- unlock all interfaces
Making the busy flag atomic would save locking all interfaces twice. Of course, get_interface() would then need to check for the busy flag, and not return those.
If that fails (maybe because the up/down path calls get_interface() on itself, and needs it to succeed), the change will get a bit more complicated, than this, though.
Okay, thanks for explaining.
comment:13 by , 10 years ago
Replying to axeld:
Making the busy flag atomic would save locking all interfaces twice.
Of course, get_interface() would then need to check for the busy flag, and not return those.
get_interface(), but also get_interface_for_device() and get_interface_for_link(), I assume.
comment:14 by , 10 years ago
Replying to jackburton:
Dunno, maybe I thought it would've been too complicated, and I just wanted a quick fix (for myself).
I just meant because I wouldn't have needed to look at the code again then :-) This way I at least noticed that _SetUp() has the same bug, though.
get_interface(), but also get_interface_for_device() and get_interface_for_link(), I assume.
Yup.
comment:15 by , 10 years ago
I attached a patch. Does it look good ? I'm testing it right now, and it doesn't show any regression, at least.
follow-up: 18 comment:16 by , 10 years ago
Almost:
- You still need to hold the all interfaces lock when setting the interface busy, otherwise you might hand out busy interfaces; the lock can be released as soon as the interface is made busy.
- Not sure if SetDown() can be reached in that phase, but if it does, it would need to check for being busy as well (and bail out in that case). Also, I don't know if it's currently given that SetDown() can only be called when there are no other interface users. If it isn't the patch is going to be more involved.
- Indenting seems off in interfaces.h ("inline" must be indented one tab if it isn't) -- if the rest doesn't fit, it's probably not correctly indented either, though (I think the current indentation rule is newer than the network stack).
Have you tried if it works? :-)
I'll try to read through that code again in the next couple of days to make sure that SetDown() works the way it should in this context. It could very well be that setting the interface busy might only be needed in the _SetUp() case.
comment:17 by , 10 years ago
Just an update: I didn't forget to update the code, but my haiku development machine (my only usable machine, actually) is currently "exploded" on my desk, waiting for a new fan.
comment:19 by , 10 years ago
Resolution: | fixed |
---|---|
Status: | closed → reopened |
Reopening since there is an updated version of the patch waiting here...
comment:21 by , 10 years ago
- I would check the busy flag with sLock hold in SetDown(), or SetBusy() should return the previous value and return if already busy (with atomic_get_and_set())
- I would also check for the busy flag in get_interface_address_for_destination(), get_interface_address_for_link().
comment:22 by , 9 years ago
I tried the patch locally.. while it doesn't introduce any new bugs, i'm seeing some odd network issues here... so I can't confirm 100% it works. I'd say we hold off on this one until we fix the current network / DHCP issues.
comment:23 by , 9 years ago
I can update the patch with Korli advices, and if there aren't objections, I'd commit it.
comment:24 by , 9 years ago
Resolution: | → fixed |
---|---|
Status: | reopened → closed |
Tested working and applied in hrev9377. If you think something's wrong, feel free to comment or fix it directly.
From the backtrace of the two threads, the lock inversion is clear: the wpa_supplicant thread calls get_interface() which first locks the "net_interfaces" mutex, then calls CreateDomainDatalinkIfNeeded() tries to acquire the recursive lock of the interface.
The ifconfig thread, on the other hand, first acquires the interface lock in Interface::SetDown(), then (through the ipv6 protocol add-on) calls get_interface() which tries to acquire the net_interfaces mutex.