#18334 closed bug (fixed)

Hangs, KDLs with callout processing in the FreeBSD compatibility layer

Reported by: waddlesplash Owned by: nobody
Priority: normal Milestone: Unscheduled
Component: Drivers/Network Version: R1/beta4
Keywords: Cc:
Blocked By: Blocking: #18338
Platform: All

Description

Yesterday, in hrev56875, I reworked the callout processing to handle another edge-case in an attempt to fix #18315.

Today, I got reports from two users (bbjimmy and zdykstra) that after updating to the latest nightly, the fbsd callout thread would occasionally spike to 100% CPU usage and stay there until a reboot, while using the ipro1000 driver. After some attempts, I managed to reproduce this myself on bare metal; I have yet to reproduce it in a VM. It is very intermittent; sometimes I get it within minutes, and sometimes it takes much longer to occur. I have not yet found any way of more reliably triggering it, having only had it about 4-5 times in multiple hours of attempts.

After staring at the code today for multiple hours, I deduced one possible way things could end up in an infinite loop: if the linked-list became corrupted. Thus, in hrev56881, I added code under KDEBUG that should have turned such corruptions into NULL-dereference KDLs. This, it appears, was successful; after this change I started getting such KDLs (a photo is attached.)

However, my repeated inspection of the code has produced no way that I can see of the problem the KDL shows occurring. Items are only removed from the linked list after checking c_due, and all this only happens while the lock is held. I tried sprinkling ASSERT_LOCKED_MUTEX around parts of the code that modify the linked list; none have so far asserted.

I also got, while trying to reproduce this problem, an "invalid concurrent access to page" KDL in the page scrubber which looks a lot like #16552 (or the other ones Diver linked in the comments there), only with most of the page fields zeroed out.

I have made various edits to the callout code in an attempt to shore things up or shake things loose (hrev56882, hrev56884, etc.), but it does not seem that any have had concrete effect.

Attachments (3)

KDL-list_remove_item.png (571.4 KB ) - added by waddlesplash 13 months ago.
fbsd_callout_hrev56884_VM.png (34.6 KB ) - added by Starcrasher 13 months ago.
DSCN7170.1.jpg (2.0 MB ) - added by bbjimmy 13 months ago.
kdl on bare metal.

Change History (20)

by waddlesplash, 13 months ago

Attachment: KDL-list_remove_item.png added

comment:1 by waddlesplash, 13 months ago

I guess it is also worth noting that my last "cleanup/fixes" commit (hrev56884) does affect behavior in at least some potential race cases (based on my cursory examination), but even if those cases were real, it would lead to non-insertions, not double-removals from the linked-lists, I think.

The "goto" I introduced yesterday may have been a not so bright idea, but, this file already has other "goto"s (not of my design) in it; and anyway if that was misbehaving, I would have expected the ASSERT_LOCKED_MUTEX to trigger.

Tomorrow I intend to try and create artificial scenarios to make the triggering of this problem, whatever it is, more likely. I will also probably refactor out the "goto", just in case. But it will also be interesting to hear from zdykstra and bbjimmy if they notice any different behavior with today's changes, i.e. if they get the exact same KDL now, or if it occurs less frequently, or something else.

comment:2 by waddlesplash, 13 months ago

I should also note that there are a number of cases where I could potentially see a way for a double-removal to have occurred outside the callout thread, but none within it. The removal happens just after having fetched the item from the list, and no lock is released inbetween. How is that possible?

comment:3 by zdykstra, 13 months ago

hrev56878, profiling a system with the bug triggered:

profiling results for thread "fbsd callout" (1231):
  tick interval:  1000 us
  total ticks:    12084 (12084000 us)
  unknown ticks:  0 (0 us,   0.00%)
  dropped ticks:  0 (0 us,   0.00%)
  samples/tick:   3.0

        hits     unknown    image
  ------------------------------------------------------------------------------
       12084           0        1 kernel_x86_64
        1763         510     7392 /boot/system/add-ons/kernel/drivers/dev/net/ipro1000

        hits       in us    in %   image  function
  ------------------------------------------------------------------------------
       12084    12084000  100.00       1  common_thread_entry(void*)
       10087    10087000   83.47       1  __system_time_rdtscp()
         743      743000    6.15    7392  callout_thread(void*)
         509      509000    4.21       1  list_get_next_item
         235      235000    1.94       1  system_time
           1        1000    0.01       1  _mutex_unlock
Last edited 13 months ago by zdykstra (previous) (diff)

comment:4 by waddlesplash, 13 months ago

Looking at the code more, after realizing that callouts may reschedule themselves, there may be a case or two that isn't quite handled properly before hrev56884; and I guess I did not reproduce this bug specifically, after that point; that was when I saw the "page scrubber" KDL.

However, I still am having trouble finding a way for a double-removal to occur from within the callout thread. It would simply have to be from _callout_stop_safe, I think; either that, or an item which has NULL next/prev to appear removed but there are other items in the list that still point to it. I can't figure out how either case could happen.

comment:5 by zelenoviy, 13 months ago

Same things on I225-V. KDL with list_remove_item after about hour of idle.

comment:6 by waddlesplash, 13 months ago

With which hrev, precisely?

comment:7 by zelenoviy, 13 months ago

hrev56884, after update from hrev56871, where KDL didn't happen (sometimes network stops working, bu t this is another story...)

Last edited 13 months ago by zelenoviy (previous) (diff)

comment:8 by Starcrasher, 13 months ago

Got exactly same KDL in my VM after updating to hrev56884.

by Starcrasher, 13 months ago

comment:9 by waddlesplash, 13 months ago

What VM, and what network device (PCI ID)?

comment:10 by Starcrasher, 13 months ago

VM is Qemu + KVM running on Mageia, card is defined as e1000 in VM / ipro1000 in Haiku

comment:11 by waddlesplash, 13 months ago

Indeed I managed to reproduce it in QEMU. That should make testing easier.

Interestingly, I had ProcessController open before the KDL; it appears the "fbsd callout" thread was using a lot of CPU just before the crash. So that may be an important clue.

comment:12 by waddlesplash, 13 months ago

Now I just watched it happen. CPU usage spiked for somewhere around 0.5-1.5 seconds before the crash.

Meanwhile, the if_config_tqg_0 thread has this stack trace:

<kernel_x86_64> reschedule(int) + 0x424
<kernel_x86_64> thread_block + 0xc6
<kernel_x86_64> _mutex_lock + 0x21a
</boot/system/add-ons/kernel/drivers/dev/net/ipro1000> _callout_stop_safe.part.0 + 0x22
</boot/system/add-ons/kernel/drivers/dev/net/ipro1000> callout_reset + 0x30
</boot/system/add-ons/kernel/drivers/dev/net/ipro1000> _task_fn_admin + 0x1fb
</boot/system/add-ons/kernel/drivers/dev/net/ipro1000> gtaskqueue_run_locked + 0xde
</boot/system/add-ons/kernel/drivers/dev/net/ipro1000> gtaskqueue_thread_loop + 0x83
<kernel_x86_64> common_thread_entry(void*) + 0x38
Last edited 13 months ago by waddlesplash (previous) (diff)

comment:13 by waddlesplash, 13 months ago

I added a snooze at the top of callout_stop_safe. New panic, in the gtaskqueue thread: the mtx_assert failed inside _callout_stop_safe (same backtrace as above otherwise.) Meanwhile, the callout thread had just been woken up and had not yet returned from _mutex_unlock.

Continuing from this KDL did not have the bug manifest; the system continued normally after that.

comment:14 by waddlesplash, 13 months ago

I think I finally have a reliable way to trigger the problem. Adding the following block to _callout_stop_safe before the MutexLocker creation triggers the KDL very reliably shortly after boot:

	if (c->c_due > 0)
		while (!callout_active(c))
			atomic_get64((int64*)&sCurrentCallout);
Last edited 13 months ago by waddlesplash (previous) (diff)

by bbjimmy, 13 months ago

Attachment: DSCN7170.1.jpg added

kdl on bare metal.

comment:15 by waddlesplash, 13 months ago

I added debug prints and noticed the looping before the crash is also still due to the linked list. Added more assertions, and:

PANIC: ASSERT FAILED (../src/system/kernel/util/list.cpp:74): link->prev != link
Welcome to Kernel Debugging Land...
Thread 326 "fbsd callout" running on CPU 2
stack trace for thread 326 "fbsd callout"
    kernel stack: 0xffffffff81a79000 to 0xffffffff81a7e000
frame                       caller             <image>:function + offset
 0 ffffffff81a7dd08 (+  24) ffffffff8014a9cc   <kernel_x86_64> arch_debug_call_with_fault_handler + 0x16
 1 ffffffff81a7dd20 (+  80) ffffffff800b1628   <kernel_x86_64> debug_call_with_fault_handler + 0x78
 2 ffffffff81a7dd70 (+  96) ffffffff800b2c93   <kernel_x86_64> kernel_debugger_loop(char const*, char const*, __va_list_tag*, int) + 0xf3
 3 ffffffff81a7ddd0 (+  80) ffffffff800b302e   <kernel_x86_64> kernel_debugger_internal(char const*, char const*, __va_list_tag*, int) + 0x6e
 4 ffffffff81a7de20 (+ 240) ffffffff800b3387   <kernel_x86_64> panic + 0xb7
 5 ffffffff81a7df10 (+  80) ffffffff81d06832   </boot/system/add-ons/kernel/drivers/dev/net/ipro1000> callout_reset + 0xc2
 6 ffffffff81a7df60 (+  32) ffffffff81d0622e   </boot/system/add-ons/kernel/drivers/dev/net/ipro1000> invoke_callout(callout*, mtx*) + 0x3d
 7 ffffffff81a7df80 (+  48) ffffffff81d06350   </boot/system/add-ons/kernel/drivers/dev/net/ipro1000> callout_thread(void*) + 0x67
 8 ffffffff81a7dfb0 (+  32) ffffffff8008e698   <kernel_x86_64> common_thread_entry(void*) + 0x38

This looks like the cause. Now to figure out how this happens...

Last edited 13 months ago by waddlesplash (previous) (diff)

comment:16 by waddlesplash, 13 months ago

Blocking: 18338 added

comment:17 by waddlesplash, 13 months ago

Resolution: fixed
Status: newclosed

Fixed in hrev56887!

Note: See TracTickets for help on using tickets.