Opened 6 years ago

Closed 4 years ago

Last modified 4 years ago

#9377 closed bug (fixed)

Network Stack Deadlock

Reported by: jackburton Owned by: axeld
Priority: normal Milestone: R1
Component: Network & Internet/Stack Version: R1/Development
Keywords: Cc:
Blocked By: Blocking:
Has a Patch: yes Platform: All

Description (last modified by jackburton)

Yesterday I was doing some tests with the network stack, and after a while, I experienced a deadlock. ifconfig was locked and couldn't be killed (even with kill -9). net_server was locked, too. I then entered KDL, and had a look at the threads involved. There was a deadlock between ifconfig and wpa_supplicant. The net_interfaces mutex was held by wpa_supplicant, which was waiting on the mutex of the wireless interface, which in turn was held by ifconfig. So I guess we have a lock inversion somewhere. It's reproducible here 100% of the times by doing: ifconfig /dev/net/<wireless_card> down while wpa_supplicant is scanning the available networks. I have an ipro3945 wireless card.

Attachments (6)

IMG_20130119_202005.jpg (745.9 KB) - added by jackburton 6 years ago.
IMG_20130119_202045.jpg (1.1 MB) - added by jackburton 6 years ago.
IMG_20130119_202455.jpg (921.1 KB) - added by jackburton 6 years ago.
IMG_20130119_202531.jpg (1.1 MB) - added by jackburton 6 years ago.
0001-Converted-the-net_interfaces-mutex-into-a-recursive-.patch (5.3 KB) - added by jackburton 6 years ago.
0001-netstack-Interface-Improve-locking.patch (4.6 KB) - added by jackburton 5 years ago.
Second try

Change History (30)

comment:1 Changed 6 years ago by jackburton

Version: R1/alpha4.1R1/Development

Changed 6 years ago by jackburton

Attachment: IMG_20130119_202005.jpg added

Changed 6 years ago by jackburton

Attachment: IMG_20130119_202045.jpg added

Changed 6 years ago by jackburton

Attachment: IMG_20130119_202455.jpg added

Changed 6 years ago by jackburton

Attachment: IMG_20130119_202531.jpg added

comment:2 Changed 6 years ago by jackburton

Description: modified (diff)

comment:3 Changed 6 years ago by jackburton

From the backtrace of the two threads, the lock inversion is clear: the wpa_supplicant thread calls get_interface() which locks the "net_interfaces" mutex, then CreateDomainDatalinkIfNeeded() try to acquire the recursive lock of the interface.

The ifconfig thread, on the other hand, first acquires the interface lock in Interface::SetDown(), then (through the ipv6 protocol add-on) calls get_interface() which tries to acquire the net_interfaces mutex.

Version 1, edited 6 years ago by jackburton (previous) (next) (diff)

comment:4 Changed 6 years ago by jackburton

Would it be possible to unlock the interface list mutex after getting the interface in get_interface(), before calling CreateDomainDatalinkIfNeeded() ?

I mean here:

Interface*
get_interface(net_domain* domain, const char* name)
{
	MutexLocker locker(sLock);

	Interface* interface = find_interface(name);
	if (interface == NULL)
		return NULL;

+       locker.Unlock();

	if (interface->CreateDomainDatalinkIfNeeded(domain) != B_OK)
		return NULL;

	interface->AcquireReference();
	return interface;
}

At least, that function doesn't seem to mess with the interfaces list, so it would seem to be safe to unlock.

comment:5 Changed 6 years ago by axeld

That does not work, unfortunately.

First of all, a second get_interface() for the same name might return an invalid interface (ie. one that hasn't been completely initialized), or CreateDomainDatalinkIfNeeded() might be called twice.

Second of all, if you do not acquire a reference before losing the lock, the interface might be deleted by someone else before you do.

In general, I think the locking order "all interfaces" -> "single interface" is more natural than the other way around. Maybe SetDown() can be changed to not hold the interface lock when calling into the interface code. If that isn't possible at all (I don't remember), another way would be to lock the interfaces lock, too (in the right order).

comment:6 Changed 6 years ago by jackburton

I went with the second solution. Here's a patch which converts the net_interfaces mutex to a recursive_lock, so that it can be acquired in Interface::SetDown(). This avoids the deadlock. Please someone review the patch.

comment:7 Changed 6 years ago by jackburton

Has a Patch: set

comment:8 Changed 6 years ago by axeld

Not my preferred solution, but it looks good. Maybe add a TODO comment to SetDown() that suggests that a better solution would be to prevent locking when calling the lower layers?

Thanks for working on this!

comment:9 Changed 6 years ago by jackburton

Resolution: fixed
Status: newclosed

Applied in hrev45240.

comment:10 in reply to:  8 Changed 5 years ago by jackburton

Replying to axeld:

Not my preferred solution, but it looks good. Maybe add a TODO comment to SetDown() that suggests that a better solution would be to prevent locking when calling the lower layers?

How would you avoid locking in this case ?

comment:11 Changed 5 years ago by axeld

Hm, why didn't you ask me 19 months ago? :-)

By looking over the code, I noticed that Interface::_SetUp() has the same issue. I think the best solution would be to use a similar approach than the VFS when vnodes are created:

  • lock all interfaces
  • create the interface, and put it in there
  • mark the interface as busy
  • unlock all interfaces
  • call into the interface protocol layer as needed (ie. interface_up()/interface_down())
  • lock all interfaces again
  • remove the busy flag
  • unlock all interfaces

Making the busy flag atomic would save locking all interfaces twice. Of course, get_interface() would then need to check for the busy flag, and not return those.

If that fails (maybe because the up/down path calls get_interface() on itself, and needs it to succeed), the change will get a bit more complicated, than this, though.

comment:12 in reply to:  11 ; Changed 5 years ago by jackburton

Replying to axeld:

Hm, why didn't you ask me 19 months ago? :-)

Dunno, maybe I thought it would've been too complicated, and I just wanted a quick fix (for myself).

By looking over the code, I noticed that Interface::_SetUp() has the same issue. I think the best solution would be to use a similar approach than the VFS when vnodes are created:

  • lock all interfaces
  • create the interface, and put it in there
  • mark the interface as busy
  • unlock all interfaces
  • call into the interface protocol layer as needed (ie. interface_up()/interface_down())
  • lock all interfaces again
  • remove the busy flag
  • unlock all interfaces

Making the busy flag atomic would save locking all interfaces twice. Of course, get_interface() would then need to check for the busy flag, and not return those.

If that fails (maybe because the up/down path calls get_interface() on itself, and needs it to succeed), the change will get a bit more complicated, than this, though.

Okay, thanks for explaining.

comment:13 in reply to:  11 Changed 5 years ago by jackburton

Replying to axeld:

Making the busy flag atomic would save locking all interfaces twice.

Of course, get_interface() would then need to check for the busy flag, and not return those.

get_interface(), but also get_interface_for_device() and get_interface_for_link(), I assume.

comment:14 in reply to:  12 Changed 5 years ago by axeld

Replying to jackburton:

Dunno, maybe I thought it would've been too complicated, and I just wanted a quick fix (for myself).

I just meant because I wouldn't have needed to look at the code again then :-) This way I at least noticed that _SetUp() has the same bug, though.

get_interface(), but also get_interface_for_device() and get_interface_for_link(), I assume.

Yup.

comment:15 Changed 5 years ago by jackburton

I attached a patch. Does it look good ? I'm testing it right now, and it doesn't show any regression, at least.

Last edited 5 years ago by jackburton (previous) (diff)

comment:16 Changed 5 years ago by axeld

Almost:

  • You still need to hold the all interfaces lock when setting the interface busy, otherwise you might hand out busy interfaces; the lock can be released as soon as the interface is made busy.
  • Not sure if SetDown() can be reached in that phase, but if it does, it would need to check for being busy as well (and bail out in that case). Also, I don't know if it's currently given that SetDown() can only be called when there are no other interface users. If it isn't the patch is going to be more involved.
  • Indenting seems off in interfaces.h ("inline" must be indented one tab if it isn't) -- if the rest doesn't fit, it's probably not correctly indented either, though (I think the current indentation rule is newer than the network stack).

Have you tried if it works? :-)

I'll try to read through that code again in the next couple of days to make sure that SetDown() works the way it should in this context. It could very well be that setting the interface busy might only be needed in the _SetUp() case.

comment:17 Changed 5 years ago by jackburton

Just an update: I didn't forget to update the code, but my haiku development machine (my only usable machine, actually) is currently "exploded" on my desk, waiting for a new fan.

Changed 5 years ago by jackburton

Second try

comment:18 in reply to:  16 Changed 5 years ago by jackburton

Replying to axeld:

Have you tried if it works? :-)

Yes, it works.

comment:19 Changed 5 years ago by pulkomandy

Resolution: fixed
Status: closedreopened

Reopening since there is an updated version of the patch waiting here...

comment:20 Changed 4 years ago by jscipione

can somebody please look into this?

comment:21 Changed 4 years ago by korli

  1. I would check the busy flag with sLock hold in SetDown(), or SetBusy() should return the previous value and return if already busy (with atomic_get_and_set())
  2. I would also check for the busy flag in get_interface_address_for_destination(), get_interface_address_for_link().

comment:22 Changed 4 years ago by kallisti5

I tried the patch locally.. while it doesn't introduce any new bugs, i'm seeing some odd network issues here... so I can't confirm 100% it works. I'd say we hold off on this one until we fix the current network / DHCP issues.

comment:23 Changed 4 years ago by jackburton

I can update the patch with Korli advices, and if there aren't objections, I'd commit it.

comment:24 Changed 4 years ago by jackburton

Resolution: fixed
Status: reopenedclosed

Tested working and applied in hrev49531. If you think something's wrong, feel free to comment or fix it directly.

Last edited 4 years ago by jackburton (previous) (diff)
Note: See TracTickets for help on using tickets.