Opened 11 years ago

Closed 10 years ago

#1641 closed bug (fixed)

KDL: rtl8139

Reported by: kaoutsis Owned by: axeld
Priority: normal Milestone: R1
Component: Drivers/Network Version: R1/pre-alpha1
Keywords: Cc: scottmc, idefix, zharik@…, HubertNG@…
Blocked By: Blocking: #1661, #1890, #2452
Has a Patch: no Platform: x86

Description

surfing with opera, after 5 minutes, came to this stace trace: rtl8139-new-bug.txt (attached)

Attachments (6)

rtl8139-new-bug.txt (1.7 KB) - added by kaoutsis 11 years ago.
stack trace for the rtl8139
8139-kdl.txt (32.1 KB) - added by scottmc 11 years ago.
This was with hrev25673, on an AMD Geode based PC board.
8139-wget-kdl.txt (6.0 KB) - added by scottmc 11 years ago.
another 8139 kdl caught via serial debug
8139-flooding-port-21-with-zeros-kdl.txt (2.7 KB) - added by stefan 11 years ago.
backtrace when crashing on flooding port 21 with zeros
DSC00252.JPG (65.9 KB) - added by Hubert 11 years ago.
hrev27420
DSC00353.JPG (43.1 KB) - added by Hubert 10 years ago.
rev. 28810 hybrid gcc4

Download all attachments as: .zip

Change History (31)

Changed 11 years ago by kaoutsis

Attachment: rtl8139-new-bug.txt added

stack trace for the rtl8139

comment:1 Changed 11 years ago by kaliber

Cc: kaliber added

I have the same problem with wget and firefox.

comment:2 Changed 11 years ago by koki

Cc: koki added; kaliber removed

Changed 11 years ago by scottmc

Attachment: 8139-kdl.txt added

This was with hrev25673, on an AMD Geode based PC board.

comment:3 Changed 11 years ago by scottmc

Cc: scottmc added

comment:4 Changed 11 years ago by phoudoin

I got the same KDL with my rtl8139 just last night, with Firefox, sorry, Bon Echo. I'm running hrev25860.

Something is wrong in BSD net drivers compatibility layer, I guess.

comment:5 Changed 11 years ago by idefix

Cc: idefix added

comment:6 Changed 11 years ago by axeld

Blocking: 2452 added

(In #2452) Installing npipefs should do no harm, as a) BeOS file systems aren't modules, so they aren't picked up at all, and b) there is no pipefs anymore, as pipes are now implemented differently.

Anyway, this is indeed a dup of #1641.

comment:7 Changed 11 years ago by diver

Component: - GeneralDrivers/Network

Same problem here, for me it could crash at boot or at firefox start or a few moments later.
As rtl8139 is so common nic these days I would recommend to fix it until alpha.

Changed 11 years ago by scottmc

Attachment: 8139-wget-kdl.txt added

another 8139 kdl caught via serial debug

comment:8 Changed 11 years ago by stefan

My system (hrev26909 image) also crashes when I send a lot of data from another computer from the network to Haiku's ftp service: cat /dev/zero | nc 192.168.1.199 21 (192.168.1.199 is the Haiku machine). Ping -f and also pings with big packet sizes are no problem.

Changed 11 years ago by stefan

backtrace when crashing on flooding port 21 with zeros

comment:9 Changed 11 years ago by mmlr

See also duplicate #2596 for another backtrace.

comment:10 Changed 11 years ago by anevilyak

Can you try with hrev27401 and see if the behavior's any better?

comment:11 Changed 11 years ago by Hubert

I check hrev27420 and it's same.

Changed 11 years ago by Hubert

Attachment: DSC00252.JPG added

comment:12 Changed 11 years ago by siarzhuk

Cc: zharik@… added

comment:13 Changed 11 years ago by siarzhuk

2 Axel: I observe the same crash ("m_free + 0x0017") frequently on my system with rtl8139. It is easily reproducible by starting Firefox. ;-)

Looks like the very first access to m_free parameter fails:

180 struct mbuf * 181 m_free(struct mbuf *m) 182 { 183 struct mbuf *next = m->m_next; 184 185 if (m->m_flags & M_EXT) 186 mb_free_ext(m); 187 else 188 object_cache_free(sMBufCache, m); 189 190 return next; 191 }

I have checked this with disassembly log. The asm commands that fail are

0x0000df84 push %ebp; m_free code start here ... 0x0000df98 mov 0x8(%ebp), %eax 0x0000df9b mov (%eax), %esi ; <--- KDL! 0x0000df9d testb $0x1, 0x10(%eax) ...

May be you have any suggestions before I try to dig into debugging this problem? :-) Looks like m->next become invalid at some time - and m_freem cannot call it with null pointer.

comment:14 Changed 11 years ago by siarzhuk

Sorry. :-( corrected code blocks:

180 struct mbuf * 
181 m_free(struct mbuf *m) 
182 { 
183   struct mbuf *next = m->m_next; 
184 
185   if (m->m_flags & M_EXT) 
186     mb_free_ext(m); 
187   else 
188     object_cache_free(sMBufCache, m); 
189 
190   return next; 
191 } 

disasm:

0x0000df84 push %ebp; m_free code start here 
... 
0x0000df98 mov 0x8(%ebp), %eax 
0x0000df9b mov (%eax), %esi ; <--- KDL! 
0x0000df9d testb $0x1, 0x10(%eax) 
... 

comment:15 Changed 11 years ago by axeld

Blocking: 1661 added

(In #1661) I'd say it's a duplicate of #1641.

comment:16 Changed 11 years ago by axeld

Blocking: 1890 added

(In #1890) Duplicate of #1641.

comment:17 Changed 11 years ago by axeld

You could add ktrace_printf() output to the m_* functions, as well as to compat_read(), and then see (don't forget to a) enable tracing in tracing_config.h, and b) enlarge the tracing buffer) via KDL "traced" what exactly happened.

comment:18 Changed 11 years ago by siarzhuk

During my "traced" games I observed 4 cases of network-related KDLs on my system with rtl8139:

1) page fault in m_free call from compat_read. It looks like one that is traced in attachment 8139-kdl.txt.

2) page fault in m_free call from m_defrag. It is mentioned above in attachment 8139-wget-kdl.txt

3) page fault in memcpy_generic call from devfs_read in "/dev/net/rtl8139 reader" thread.

4) page fault in CompareC24ConnectionHashDefinitionRCt4pair2ZPC8sockaddrZPC8sockaddrP11TCPEndpoint. It is already submitted as ticket #2706.

First of all I have investigated the "case 1" because it was observed very frequently on my system. This problem occure as follows: in interrupt handler the rl_rxeof create mbuf for recieved data by call of m_devget. Right after this in the same call of interrupt handler rl_rxeof create another mbuf with next packet of received data by calling m_devget again. After the interrupt handler is finished the compat_read copy received data and attempt to free the mbuf created by first call of m_devget. This attempt failed because m_next of this mbuf is invalid (traced says that it almost always has value of 0x00000d36)

The "case 2" was observed rarely and looks like related the same problem as "case 1" but during rl_txeof handling.

About the "case 3" and "case 4" I thought it is not related to mbuf problem.

During browsing Trac tickets for something related to this problems I found ticket #2758 that describe some problem in m_devget.

I have tried mentioned in that ticked fix from Adek336 in compat/sys/mbuf.h

 #define MLEN            ((int)(MSIZE - sizeof(struct m_hdr)))
-#define MHLEN           ((int)(MSIZE - sizeof(struct pkthdr)))
+#define MHLEN           ((int)(MLEN - sizeof(struct pkthdr)))

Now I cannot observe "Cases 1,2,3" for about of 1 hour of stress testing.

Unfortunately the "case 4" (ticket #2706) is still reproducible on my system.

comment:19 Changed 11 years ago by axeld

Resolution: fixed
Status: newclosed

Looks like this had the same cause as #2758, and should therefore be fixed by hrev27771. Please reopen if not everyone is lucky yet :-)

comment:20 Changed 10 years ago by Hubert

Cc: HubertNG@… added
Platform: Allx86
Resolution: fixed
Status: closedreopened

I reproduced this bug in hybrid gcc4 28810 on FF 2.0.0.12

Changed 10 years ago by Hubert

Attachment: DSC00353.JPG added

rev. 28810 hybrid gcc4

comment:21 in reply to:  20 ; Changed 10 years ago by siarzhuk

Replying to Hubert:

I reproduced this bug in hybrid gcc4 28810 on FF 2.0.0.12

IMHO: Your stack crawl looks like it is related to #2706 not to this one. I think #2706 (#2279?) is a better place to put it in. This one is related to m_buf handling and was really fixed. :-)

comment:22 in reply to:  21 ; Changed 10 years ago by Hubert

Replying to siarzhuk: I don't know, but this is bug "0 reader" too, not only "vm_page_fault: unhandled page fault in kernel space at..."

comment:23 in reply to:  22 Changed 10 years ago by Hubert

Replying to siarzhuk:

Upss, sorry, you have right, this "0 consumer" error, my big mistake, sorry one more...

comment:24 Changed 10 years ago by Hubert

I moved to #2706 Please closed this bug again.

comment:25 Changed 10 years ago by mmlr

Resolution: fixed
Status: reopenedclosed
Note: See TracTickets for help on using tickets.