Opened 9 months ago

Closed 8 months ago

#18550 closed bug (duplicate)

dec21xxx Network driver is broken since GCC upgrade.

Reported by: jmairboeck Owned by: nobody
Priority: normal Milestone: Unscheduled
Component: Drivers/Network Version: R1/Development
Keywords: Cc: The_Ringmaster, korli
Blocked By: Blocking: #18593
Platform: All

Description

This is hrev57208.

Since the upgrade to GCC 13, the dec21xxx network driver is broken. This is used as the "legacy" network adapter in Hyper-V.

It results in the following KDL on boot:

PANIC: Unexpected exception "General Protection Exception" occurred in kernel mode! Error code: 0x0

Welcome to Kernel Debugging Land...
Thread 793 "fbsd callout" running on CPU 0
stack trace for thread 793 "fbsd callout"
    kernel stack: 0xffffffff81bdc000 to 0xffffffff81be1000
frame                       caller             <image>:function + offset
 0 ffffffff81be0928 (+  24) ffffffff8014955c   <kernel_x86_64> arch_debug_call_with_fault_handler + 0x16
 1 ffffffff81be0940 (+  80) ffffffff800b2e08   <kernel_x86_64> debug_call_with_fault_handler + 0x78
 2 ffffffff81be0990 (+  96) ffffffff800b44b4   <kernel_x86_64> kernel_debugger_loop(char const*, char const*, __va_list_tag*, int) + 0xf4
 3 ffffffff81be09f0 (+  80) ffffffff800b484e   <kernel_x86_64> kernel_debugger_internal(char const*, char const*, __va_list_tag*, int) + 0x6e
 4 ffffffff81be0a40 (+ 240) ffffffff800b4ba7   <kernel_x86_64> panic + 0xb7
 5 ffffffff81be0b30 (+ 856) ffffffff8014ae1c   <kernel_x86_64> int_bottom + 0x80
kernel iframe at 0xffffffff81be0e88 (end = 0xffffffff81be0f50)
 rax 0x2842022             rbx 0xffffffff821d3c00    rcx 0x0
 rdx 0xffffffff815d3000    rsi 0xffffffff815d3030    rdi 0xffffffff94bd3d6e
 rbp 0xffffffff81be0f70     r8 0xfffffffffffffffd     r9 0xfffffffffffffffc
 r10 0xfffffffffffffff8    r11 0x85                  r12 0xffffffff94bd3d10
 r13 0xffffffff9495172e    r14 0x7fffffffffffffff    r15 0xffffffff821d4168
 rip 0xffffffff81bb5880    rsp 0xffffffff81be0f50 rflags 0x10286
 vector: 0xd, error code: 0x0
 6 ffffffff81be0e88 (+ 232) ffffffff81bb5880   </boot/system/add-ons/kernel/drivers/dev/net/dec21xxx> tulip_txprobe.isra.0 + 0x110
 7 ffffffff81be0f70 (+  64) ffffffff81bc1b2e   </boot/system/add-ons/kernel/drivers/dev/net/dec21xxx> callout_thread(void*) + 0x11e
 8 ffffffff81be0fb0 (+  32) ffffffff8008bd77   <kernel_x86_64> common_thread_entry(void*) + 0x37
 9 ffffffff81be0fd0 (+2118250544) ffffffff81be0fe0   12388:fbsd callout_793_kstack@0xffffffff81bdc000 + 0x4fe0
kdebug>

This was already reported in ticket:18541:9 by The_Ringmaster, but this should be a separate ticket. I just encountered the same problem.

Note: the dec21xxx driver is missing from the component selection in Trac.

Attachments (2)

bootlog.txt (40.0 KB ) - added by jmairboeck 9 months ago.
bootlog-O1.txt (74.7 KB ) - added by jmairboeck 9 months ago.

Download all attachments as: .zip

Change History (29)

comment:1 by Starcrasher, 9 months ago

The_Ringmaster's originally posted a screenshot [here]https://discuss.haiku-os.org/t/kdl-crash-when-installing-anything-via-pkgman/13789 on the forum, where he says that went away. Since you tested with current nightly, it seems not.

comment:2 by waddlesplash, 9 months ago

Please capture a serial log up to the KDL and attach it.

by jmairboeck, 9 months ago

Attachment: bootlog.txt added

comment:3 by jmairboeck, 9 months ago

Using a self-compiled version of dec21xxx with -O0 for the driver and libfreebsd_network.a boots successfully.

I used the host compiler (configure without arguments), which is currently gcc 13.1 (2023_06_20).

comment:4 by jmairboeck, 9 months ago

Using -O0 only for dec21xxx is sufficient apparently. For freebsd_network it is not needed. I suspected that already because the same system works fine using ipro1000 in VirtualBox. (This is still my "portable" Haiku system which was once bare metal but now is only a hard disk on a USB adapter because that laptop died a few years ago ...)

by jmairboeck, 9 months ago

Attachment: bootlog-O1.txt added

comment:5 by jmairboeck, 9 months ago

With -O1 I get an SMEP violation and page faults, but not anywhere in dec21xxx (see attached log). I continued a few times until there were no more different panics.

I suspect that the scheduler related stuff is expected by pausing in the kernel debugger.

comment:6 by jmairboeck, 9 months ago

The driver works apparently with -O1 when compiled with gcc 13.2. However, the original "General protection exception" KDL still occurs identically.

comment:7 by waddlesplash, 9 months ago

Please use dis -b 6 and paste the output here.

comment:8 by waddlesplash, 9 months ago

(At the KDL prompt, I mean, so we can see the faulting instruction and disassembly.)

comment:9 by The_Ringmaster, 9 months ago

kdebug> dis -b 6

[*READ/WRITE FAULT (?), pc: 0xffffffff800b9bf2 *] kdebug>

comment:10 by jmairboeck, 9 months ago

My output says exactly the same (using gcc 13.2).

comment:11 by jmairboeck, 9 months ago

I just noticed that the de driver, which is the part that is used in Hyper-V apparently, has been deprecated and was removed from FreeBSD 13. See https://github.com/freebsd/fcp/blob/master/fcp-0101.md

What does this mean for Haiku?

Note that dc (which is also contained in the same Haiku driver) is not affected by this.

comment:12 by waddlesplash, 9 months ago

For now, FreeBSD's APIs haven't changed too much, and so it's not hard to keep around. Hopefully that won't change in the future.

Can you poke around in KDL and try and find out what (if anything) is at the fault memory address, i.e. what area it's in?

comment:13 by jmairboeck, 9 months ago

I tried compiling just the tulip_txprobe function with O1 or O0 (using #pragma GCC optimize). This "fixes" the General Protection Exception, but I get the SMEP violation and page faults instead in both cases.

Trying to find the area of the involved addresses (with area contains <address>) just gives the same READ/WRITE FAULT message as before.

comment:14 by jmairboeck, 9 months ago

Applying -O1 or -O0 to the whole file if_de.c doesn't help either. The outcome is the same as above.

comment:15 by waddlesplash, 9 months ago

It's possible that something is trying to jump to an invalid address, though I've no idea how that could be the case here.

I'll set aside some time to look through the disassembly under O1 vs O2 and see if anything jumps out.

comment:16 by waddlesplash, 9 months ago

Applying -O1 or -O0 to the whole file if_de.c doesn't help either.

Where did you apply it? Before, or after the includes block? If after, please try before.

Please also apply it to the glue_de.c, if that makes no difference. If applying both still causes the SMAP violation, but the same flag specified on the command line fixes the problem, then I would suspect the flag is not getting applied correctly.

comment:17 by waddlesplash, 9 months ago

So, inspecting the disassembly, the primary difference between -O2 and -O1 is that the -O2 version uses SSE2 registers and instructions, probably most notably pshufd. Thus it may be interesting to try -mno-sse2 and maybe even also -mno-sse on the whole file.

comment:18 by waddlesplash, 9 months ago

Cc: korli added

CC korli: both this and #18541 have XSAVEC enabled, perhaps that or something else related to FPU state is involved?

(It appears my VMware setup, which has an ipro1000 device, also has XSAVEC enabled but there are no problems there.)

in reply to:  16 comment:19 by jmairboeck, 9 months ago

Replying to waddlesplash:

Applying -O1 or -O0 to the whole file if_de.c doesn't help either.

Where did you apply it? Before, or after the includes block? If after, please try before.

Please also apply it to the glue_de.c, if that makes no difference. If applying both still causes the SMAP violation, but the same flag specified on the command line fixes the problem, then I would suspect the flag is not getting applied correctly.

I added CCFLAGS on [ FGristFiles if_de.o ] = -O0 ; just before the SubDirCcFlags line. I checked with jam -dx and the flag seemed to be applied correctly. I'll try adding glue_de too.

comment:20 by jmairboeck, 9 months ago

Now this is weird: I tried adding -O0 to SubDirCcFlags again (just the glue file didn't work), and now it doesn't work any more too. This did work before (see above). I still get the SMEP violation.

comment:21 by jmairboeck, 9 months ago

Now I booted it again and it works! It seems like the SMEP violation doesn't always occur (which makes debugging it quite a bit harder ...).

comment:22 by waddlesplash, 8 months ago

Blocking: 18593 added

comment:23 by waddlesplash, 8 months ago

Please retest after hrev57286.

comment:24 by The_Ringmaster, 8 months ago

Still an issue for me, same kdl output upon boot

comment:25 by jmairboeck, 8 months ago

A freshly downloaded image of hrev57287 did boot successfully in Hyper-V (app_server started and it showed the FirstBootPrompt). However, I couldn't test anything else yet because capturing the mouse in Hyper-V doesn't work over Remote Desktop apparently.

comment:26 by jmairboeck, 8 months ago

I tested this again today and it works now. The network is still very slow, however.

comment:27 by waddlesplash, 8 months ago

Resolution: duplicate
Status: newclosed

Well, that's a separate problem.

The_Ringmaster: I find it odd that you're still getting a KDL ... are you sure you upgraded past the fix revision? If so, and it still happens, then you can open a new ticket I suppose, or we can reopen your old one (if it's really the same.)

Note: See TracTickets for help on using tickets.