Opened 11 years ago

Closed 11 years ago

#2706 closed bug (fixed)

PANIC: vm_page_fault: unhandled page fault in kernel space at 0xdeadbf07, ip 0x9428897e

Reported by: stippi Owned by: axeld
Priority: critical Milestone: R1/alpha1
Component: Network & Internet/TCP Version: R1/pre-alpha1
Keywords: Cc: imker@…, adek336@…
Blocked By: Blocking: #3392
Has a Patch: no Platform: x86

Description (last modified by stippi)

Happens to me when using Firefox on hrev27315. Here is the stack crawl:

stack trace for thread 80 "/dev/net/ipro1000/0 consumer"

<kernel>vm_page_fault
<kernel>page_fault_exception
<kernel>int_bottom

kernel iframe

<...network/protocols/tcp>Compare__C24ConnectionHashDefinitionRCt4pair2ZPC8sockaddrZPC8sockaddrP11TCPEndpoint + 0x0022
<...network/protocols/tcp>_LockupConnection__15EndpointManagerP8sockaddrT1 + 0x0066
<...network/protocols/tcp>FindConnection__15EndpointManagerP8sockaddrT1 + 0x0054
<...network/protocols/tcp>tcp_receive_data__FP10net_buffer + 0x02e9
<...network/protocols/ipv4>ipv4_receive_data__FP10net_buffer + 0x02ba
<...network/stack>domain_receive_adapter__FPvP10net_deviceP10net_buffer + 0x003a
<...network/stack>device_consumer_thread__FPv + 0x00c7
[...]

Attachments (15)

img_0974.0.jpg (147.7 KB ) - added by Adek336 11 years ago.
broadcom tigon3: deadbeef in consumer thread
img_0972.0.jpg (176.8 KB ) - added by Adek336 11 years ago.
via rhine II: deadbeef in consumer thread (other computer)
img_0973.0.jpg (114.4 KB ) - added by Adek336 11 years ago.
syslog after via_rhine consumer crash
kdl.png (23.9 KB ) - added by Adek336 11 years ago.
ipro100: deadbeef in consumer thread (screenshot downloaded from somewhere)
DSC00353.JPG (43.1 KB ) - added by Hubert 11 years ago.
rev. 28810 hybrid gcc4
img_11410.jpg (446.6 KB ) - added by Adek336 11 years ago.
bfe when opening lot of connections to self with netcat
img_11420.jpg (418.3 KB ) - added by Adek336 11 years ago.
bge, hybrid, torrent, note that nic's interrupt handler is active
img_1022.jpg (252.7 KB ) - added by Adek336 11 years ago.
bge,gcc2,ConnectionHashDefinition::Compare, EndpointManager::_LookupConnection
img_1309 SegmentReceived not locked Firefox.jpg (213.2 KB ) - added by Adek336 11 years ago.
img_1389.jpg (460.4 KB ) - added by Adek336 11 years ago.
img_1391.jpg (402.7 KB ) - added by Adek336 11 years ago.
img_1392.jpg (432.2 KB ) - added by Adek336 11 years ago.
img_1426.jpg (453.8 KB ) - added by Adek336 11 years ago.
img_1440.jpg (222.0 KB ) - added by Adek336 11 years ago.
img_1441.jpg (393.7 KB ) - added by Adek336 11 years ago.

Change History (41)

comment:1 by stippi, 11 years ago

Description: modified (diff)

comment:2 by siarzhuk, 11 years ago

Cc: imker@… added

by Adek336, 11 years ago

Attachment: img_0974.0.jpg added

broadcom tigon3: deadbeef in consumer thread

by Adek336, 11 years ago

Attachment: img_0972.0.jpg added

via rhine II: deadbeef in consumer thread (other computer)

comment:3 by Adek336, 11 years ago

All three of these adapters: Via rhine II, Broadcom Tigon3 and ipro100 use the FreeBSD compability layer.

comment:4 by Adek336, 11 years ago

Cc: adek336@… added

by Adek336, 11 years ago

Attachment: img_0973.0.jpg added

syslog after via_rhine consumer crash

by Adek336, 11 years ago

Attachment: kdl.png added

ipro100: deadbeef in consumer thread (screenshot downloaded from somewhere)

comment:5 by Adek336, 11 years ago

#2279 dup

comment:6 by Hubert, 11 years ago

Platform: Allx86

I reproduced this bug in hybrid gcc4 28810 on FF 2.0.0.12 (Realtek 8139), on gcc2 version work fine.

by Hubert, 11 years ago

Attachment: DSC00353.JPG added

rev. 28810 hybrid gcc4

by Adek336, 11 years ago

Attachment: img_11410.jpg added

bfe when opening lot of connections to self with netcat

comment:7 by Adek336, 11 years ago

Note #2594.

by Adek336, 11 years ago

Attachment: img_11420.jpg added

bge, hybrid, torrent, note that nic's interrupt handler is active

comment:8 by Adek336, 11 years ago

it's bge not bfe. Why are the function names not demangled?

by Adek336, 11 years ago

Attachment: img_1022.jpg added

bge,gcc2,ConnectionHashDefinition::Compare, EndpointManager::_LookupConnection

comment:9 by Adek336, 11 years ago

img_1309 SegmentReceived not locked Firefox.jpg: happens with modified code when I open up transmission with 30 torrents and after a few minutes try to open Firefox. Crashes before Firefox window opens. Continuing may but doesnt have to result in a deadbeef page fault in "loop consumer thread".

Modification:

int32
TCPEndpoint::SegmentReceived(tcp_segment_header& segment, net_buffer* buffer)
{
        MutexLocker locker(fLock);
+       ASSERT(locker.IsLocked());

        TRACE("SegmentReceived(): buffer %p (%lu bytes) address %s to %s\n"

There's lot more places with MutexLocker but they've been all properly locked as far as I checked.

by Adek336, 11 years ago

Attachment: img_1389.jpg added

comment:10 by Adek336, 11 years ago

img_1389.jpg: photo of KDL. Diagnostic messages added to net_route_private constructor and destructor. Also added to TCPEndpoint::ReadData, where the value of this->fRoute and this->fRoute->interface is dumped to the trace buffer many times during the function.

We see from the backtrace that TCPEndpoint::ReadData is running. We see from the trace messages that it is running since at least the 5093389 trace message. We see from 5093438, that this->fRoute = 0x812c8d48. In 5095383 we see that net_route_private of that address is being destroyed.

It should not be destroyed; that causes TCPEndpoint::ReadData to eventually use pointers overwritten with other data and to page fault.

by Adek336, 11 years ago

Attachment: img_1391.jpg added

by Adek336, 11 years ago

Attachment: img_1392.jpg added

comment:11 by Adek336, 11 years ago

That lock that isn't always locked and the changing routing tables are two problems, a third one is with

add-ons/kernel/network/protocols/tcp/BufferQueue.cpp

    290 status_t
    291 BufferQueue::Get(size_t bytes, bool remove, net_buffer **_buffer)
    292 {
...
    324 	while (bytesLeft > 0 && (source = iterator.Next()) != NULL) {
    325 		size_t size = min_c(source->size, bytesLeft);
    326 		status = gBufferModule->append_cloned(buffer, source, 0, size);
    327 		if (status < B_OK)
    328 			break;
    329 
    330 		bytesLeft -= size;
    331 
    332 		if (!remove)
    333 			continue;
    334 
    335 		// remove either the whole buffer or only the part we cloned
    336 
    337 		fFirstSequence += size;
    338 
    339 		if (size == source->size) {
    340 			iterator.Remove();
    341 			gBufferModule->free(source);
    342 		} else {
    343 			gBufferModule->remove_header(source, size);
    344 			source->sequence += size;
    345 		}
    346 	}
    347 
    348 	if (status == B_OK) {
    349 		*_buffer = buffer;
    350 		if (remove) {
    351 			fNumBytes -= bytes;
    352 			fContiguousBytes -= bytes;
    353 		}
    354 	} else
    355 		gBufferModule->free(buffer);
    356 
    357 	return status;
    358 }


in that the loop may remove buffers in line 340 and later the loop may be broken in line 328. Then, fNumBytes, fContiguos and etc. are not updated in spite of the fact that buffers have been removed. That leaves the buffer queue in an inconsistent state where it, for example, claims to have more data than it really has in its buffers.

comment:12 by axeld, 11 years ago

The BufferQueue problem is a different one, though (ie. bug #2594). Thanks for looking into it, that pretty much explains the problem :-)

comment:13 by Adek336, 11 years ago

Actually I added code to panic if it breaks out of the loop after having removed buffers and yet managed to reproduce #2594.

comment:14 by Adek336, 11 years ago

I mean, it didn't trigger my panic, but it did trigger the one from #2594.

by Adek336, 11 years ago

Attachment: img_1426.jpg added

in reply to:  10 comment:15 by Adek336, 11 years ago

Replying to Adek336:

In 5095383 we see that net_route_private of that address is being destroyed.

img_1389.jpg doesn't show that after all. Uploaded img_1426.jpg which shows what img_1389.jpg was supposed to.

comment:16 by Adek336, 11 years ago

This is not shown in the photo, but many net_route_private instances have absurd reference counts which often exceed fifty thousand.

by Adek336, 11 years ago

Attachment: img_1440.jpg added

by Adek336, 11 years ago

Attachment: img_1441.jpg added

comment:17 by Adek336, 11 years ago

Index: src/add-ons/kernel/network/protocols/tcp/TCPEndpoint.cpp
===================================================================
--- src/add-ons/kernel/network/protocols/tcp/TCPEndpoint.cpp    (wersja 28891)
+++ src/add-ons/kernel/network/protocols/tcp/TCPEndpoint.cpp    (kopia robocza)
@@ -430,6 +430,8 @@
        gStackModule->wait_for_timer(&fPersistTimer);
        gStackModule->wait_for_timer(&fDelayedAcknowledgeTimer);
        gStackModule->wait_for_timer(&fTimeWaitTimer);
+
+       gDatalinkModule->put_route(Domain(), fRoute);
 }

to avoid absurd reference counts in routes.

comment:18 by Adek336, 11 years ago

The crash (a default route being deleted while still being referenced after somebody does put_route() without prior get_route()) can be reproduced very fast if a command like this runs while starting Transmission with many torrents:

while true; do ifconfig /dev/net/3com/0 192.168.1.101 up; done

comment:19 by axeld, 11 years ago

I think there are two problems: 1) TCP is not using net_route_infos but net_routes directly, and 2) some route locking in the stack has apparently been removed some time ago - it doesn't look very good as it is now.

comment:20 by Adek336, 11 years ago

With net_route_infos, to use the route, one has to read net_route_info::route, and then increase it's reference count. But the route may be destroyed in between! Clearly a function to do it atomically is needed. There also seem to be no users of net_route_info atm :-))

Apart from locking, there might be a problem with deleting a default route and setting up a new one. The current code would put the route and try to setup a new one. However, only one default route may exist, so if the action of removing the old route was delayed (because, for example, some small helper function like ipv4_get_mtu got the route and it's thread lost the cpu before putting the route) then the new route won't be set up. Also the old route is going to be removed shortly, so no default routes remain!

One solution is to drop the route from the route list when removing the route. That lets the new route be created and the old route used. If also everybody except the small helper functions use net_route_info than they would soon update and put the old route.

Another solution is, when servicing SIOCDELR, to block servicing SIOCDELR until everyone puts the old route. Again, this requires that everybody use net_route_info and if net_timer would want to remove the route, it might deadlock (but I think only user apps would ever want to remove routes and that net_timer doesn't try to remove routes).

As for why the page fault happens: when net_server tries to reconfigure the default route, the stack puts the route to tell the world that it no longer references it, that it gives up it's ownership of the route. The route still has a positive reference count due to TCP endpoints, so it is not really removed and the new default route isn't installed when another default route exists. Soon net_server wants to reconfigure the default route and the stack puts the route again... so it gives up ownership of a route whom it does not possess ! Possibly net_route_private could remember how many times it was supposed to be removed so that remove_route would panic or return error on second and later tries.

comment:21 by Adek336, 11 years ago

Note that changeset:28892 will expose the fact that TCP doesn't use net_route_info as page faults in TCP destructor.

comment:22 by Adek336, 11 years ago

TCPEndpoint*

in reply to:  20 comment:23 by axeld, 11 years ago

Replying to Adek336:

One solution is to drop the route from the route list when removing the route. That lets the new route be created and the old route used. If also everybody except the small helper functions use net_route_info than they would soon update and put the old route.

That was actually the basic idea behind the route info. I hope to be able to solve this soon, and hopefully finally, unless you want to ;-)

And thanks a lot for your investigations!

comment:24 by umccullough, 11 years ago

Blocking: 3392 added

comment:25 by axeld, 11 years ago

Blocking: 3465 added

(In #3465) Duplicate of #2706.

comment:25 by axeld, 11 years ago

Blocking: 3465 removed
Resolution: fixed
Status: newclosed

Fixed in hrev29386. TCP is still using net_routes directly, but this doesn't cause any crashes anymore.

Note: See TracTickets for help on using tickets.