Opened 17 years ago

Closed 16 years ago

#1641 closed bug (fixed)

KDL: rtl8139

Reported by: kaoutsis Owned by: axeld
Priority: normal Milestone: R1
Component: Drivers/Network Version: R1/pre-alpha1
Keywords: Cc: scottmc, idefix, zharik@…, HubertNG@…
Blocked By: Blocking: #1661, #1890, #2452
Platform: x86

Description

surfing with opera, after 5 minutes, came to this stace trace: rtl8139-new-bug.txt (attached)

Attachments (6)

rtl8139-new-bug.txt (1.7 KB ) - added by kaoutsis 17 years ago.
stack trace for the rtl8139
8139-kdl.txt (32.1 KB ) - added by scottmc 16 years ago.
This was with hrev25673, on an AMD Geode based PC board.
8139-wget-kdl.txt (6.0 KB ) - added by scottmc 16 years ago.
another 8139 kdl caught via serial debug
8139-flooding-port-21-with-zeros-kdl.txt (2.7 KB ) - added by stefan 16 years ago.
backtrace when crashing on flooding port 21 with zeros
DSC00252.JPG (65.9 KB ) - added by Hubert 16 years ago.
hrev27420
DSC00353.JPG (43.1 KB ) - added by Hubert 16 years ago.
rev. 28810 hybrid gcc4

Download all attachments as: .zip

Change History (31)

by kaoutsis, 17 years ago

Attachment: rtl8139-new-bug.txt added

stack trace for the rtl8139

comment:1 by kaliber, 17 years ago

Cc: kaliber added

I have the same problem with wget and firefox.

comment:2 by koki, 17 years ago

Cc: koki added; kaliber removed

by scottmc, 16 years ago

Attachment: 8139-kdl.txt added

This was with hrev25673, on an AMD Geode based PC board.

comment:3 by scottmc, 16 years ago

Cc: scottmc added

comment:4 by phoudoin, 16 years ago

I got the same KDL with my rtl8139 just last night, with Firefox, sorry, Bon Echo. I'm running hrev25860.

Something is wrong in BSD net drivers compatibility layer, I guess.

comment:5 by idefix, 16 years ago

Cc: idefix added

comment:6 by axeld, 16 years ago

Blocking: 2452 added

(In #2452) Installing npipefs should do no harm, as a) BeOS file systems aren't modules, so they aren't picked up at all, and b) there is no pipefs anymore, as pipes are now implemented differently.

Anyway, this is indeed a dup of #1641.

comment:7 by diver, 16 years ago

Component: - GeneralDrivers/Network

Same problem here, for me it could crash at boot or at firefox start or a few moments later.
As rtl8139 is so common nic these days I would recommend to fix it until alpha.

by scottmc, 16 years ago

Attachment: 8139-wget-kdl.txt added

another 8139 kdl caught via serial debug

comment:8 by stefan, 16 years ago

My system (hrev26909 image) also crashes when I send a lot of data from another computer from the network to Haiku's ftp service: cat /dev/zero | nc 192.168.1.199 21 (192.168.1.199 is the Haiku machine). Ping -f and also pings with big packet sizes are no problem.

by stefan, 16 years ago

backtrace when crashing on flooding port 21 with zeros

comment:9 by mmlr, 16 years ago

See also duplicate #2596 for another backtrace.

comment:10 by anevilyak, 16 years ago

Can you try with hrev27401 and see if the behavior's any better?

comment:11 by Hubert, 16 years ago

I check hrev27420 and it's same.

by Hubert, 16 years ago

Attachment: DSC00252.JPG added

comment:12 by siarzhuk, 16 years ago

Cc: zharik@… added

comment:13 by siarzhuk, 16 years ago

2 Axel: I observe the same crash ("m_free + 0x0017") frequently on my system with rtl8139. It is easily reproducible by starting Firefox. ;-)

Looks like the very first access to m_free parameter fails:

180 struct mbuf * 181 m_free(struct mbuf *m) 182 { 183 struct mbuf *next = m->m_next; 184 185 if (m->m_flags & M_EXT) 186 mb_free_ext(m); 187 else 188 object_cache_free(sMBufCache, m); 189 190 return next; 191 }

I have checked this with disassembly log. The asm commands that fail are

0x0000df84 push %ebp; m_free code start here ... 0x0000df98 mov 0x8(%ebp), %eax 0x0000df9b mov (%eax), %esi ; <--- KDL! 0x0000df9d testb $0x1, 0x10(%eax) ...

May be you have any suggestions before I try to dig into debugging this problem? :-) Looks like m->next become invalid at some time - and m_freem cannot call it with null pointer.

comment:14 by siarzhuk, 16 years ago

Sorry. :-( corrected code blocks:

180 struct mbuf * 
181 m_free(struct mbuf *m) 
182 { 
183   struct mbuf *next = m->m_next; 
184 
185   if (m->m_flags & M_EXT) 
186     mb_free_ext(m); 
187   else 
188     object_cache_free(sMBufCache, m); 
189 
190   return next; 
191 } 

disasm:

0x0000df84 push %ebp; m_free code start here 
... 
0x0000df98 mov 0x8(%ebp), %eax 
0x0000df9b mov (%eax), %esi ; <--- KDL! 
0x0000df9d testb $0x1, 0x10(%eax) 
... 

comment:15 by axeld, 16 years ago

Blocking: 1661 added

(In #1661) I'd say it's a duplicate of #1641.

comment:16 by axeld, 16 years ago

Blocking: 1890 added

(In #1890) Duplicate of #1641.

comment:17 by axeld, 16 years ago

You could add ktrace_printf() output to the m_* functions, as well as to compat_read(), and then see (don't forget to a) enable tracing in tracing_config.h, and b) enlarge the tracing buffer) via KDL "traced" what exactly happened.

comment:18 by siarzhuk, 16 years ago

During my "traced" games I observed 4 cases of network-related KDLs on my system with rtl8139:

1) page fault in m_free call from compat_read. It looks like one that is traced in attachment 8139-kdl.txt.

2) page fault in m_free call from m_defrag. It is mentioned above in attachment 8139-wget-kdl.txt

3) page fault in memcpy_generic call from devfs_read in "/dev/net/rtl8139 reader" thread.

4) page fault in CompareC24ConnectionHashDefinitionRCt4pair2ZPC8sockaddrZPC8sockaddrP11TCPEndpoint. It is already submitted as ticket #2706.

First of all I have investigated the "case 1" because it was observed very frequently on my system. This problem occure as follows: in interrupt handler the rl_rxeof create mbuf for recieved data by call of m_devget. Right after this in the same call of interrupt handler rl_rxeof create another mbuf with next packet of received data by calling m_devget again. After the interrupt handler is finished the compat_read copy received data and attempt to free the mbuf created by first call of m_devget. This attempt failed because m_next of this mbuf is invalid (traced says that it almost always has value of 0x00000d36)

The "case 2" was observed rarely and looks like related the same problem as "case 1" but during rl_txeof handling.

About the "case 3" and "case 4" I thought it is not related to mbuf problem.

During browsing Trac tickets for something related to this problems I found ticket #2758 that describe some problem in m_devget.

I have tried mentioned in that ticked fix from Adek336 in compat/sys/mbuf.h

 #define MLEN            ((int)(MSIZE - sizeof(struct m_hdr)))
-#define MHLEN           ((int)(MSIZE - sizeof(struct pkthdr)))
+#define MHLEN           ((int)(MLEN - sizeof(struct pkthdr)))

Now I cannot observe "Cases 1,2,3" for about of 1 hour of stress testing.

Unfortunately the "case 4" (ticket #2706) is still reproducible on my system.

comment:19 by axeld, 16 years ago

Resolution: fixed
Status: newclosed

Looks like this had the same cause as #2758, and should therefore be fixed by hrev27771. Please reopen if not everyone is lucky yet :-)

comment:20 by Hubert, 16 years ago

Cc: HubertNG@… added
Platform: Allx86
Resolution: fixed
Status: closedreopened

I reproduced this bug in hybrid gcc4 28810 on FF 2.0.0.12

by Hubert, 16 years ago

Attachment: DSC00353.JPG added

rev. 28810 hybrid gcc4

in reply to:  20 ; comment:21 by siarzhuk, 16 years ago

Replying to Hubert:

I reproduced this bug in hybrid gcc4 28810 on FF 2.0.0.12

IMHO: Your stack crawl looks like it is related to #2706 not to this one. I think #2706 (#2279?) is a better place to put it in. This one is related to m_buf handling and was really fixed. :-)

in reply to:  21 ; comment:22 by Hubert, 16 years ago

Replying to siarzhuk: I don't know, but this is bug "0 reader" too, not only "vm_page_fault: unhandled page fault in kernel space at..."

in reply to:  22 comment:23 by Hubert, 16 years ago

Replying to siarzhuk:

Upss, sorry, you have right, this "0 consumer" error, my big mistake, sorry one more...

comment:24 by Hubert, 16 years ago

I moved to #2706 Please closed this bug again.

comment:25 by mmlr, 16 years ago

Resolution: fixed
Status: reopenedclosed
Note: See TracTickets for help on using tickets.