Ticket #2172 (closed bug: fixed)

Opened 22 months ago

Last modified 21 months ago

neon's single_read test hangs at 100% CPU

Reported by: andreasf Owned by: axeld
Priority: critical Milestone: R1/alpha1
Component: Network & Internet/TCP Version: R1/pre-alpha1
Keywords: Cc:
Blocked By: Platform: x86
Blocking:

Description

neon 0.25.x SVN branch, virtually unpatched, hangs at the single_read socket test (make check).

CPU stays at 100%, and not even kill -9 <pid> can abort the test.

neon is an HTTP/WebDAV library used by Subversion; I compiled it against the expat XML parser and (optionally) against OpenSSL (no influence).  http://tools.assembla.com/BePorts/wiki/net-misc/neon

Change History

  Changed 22 months ago by andreasf

In r25256 it fails one test earlier, peak_close, with the same symptoms.

No hints in serial output. The kernel debugger seems to indicate that neon's socket app has a single thread, in the running state.

kdebug> threads 3230                                                            
thread         id  state        sem/cv cpu pri  stack      team  name           
0x90d5c800   3230  running        -      0  10  0x90a9a000 3230  socket

follow-up: ↓ 3   Changed 22 months ago by korli

You could try "bt 3230" or "sc 3230" to see the current stacktrace.

in reply to: ↑ 2   Changed 22 months ago by andreasf

(the above should have read peek_close)

Replying to korli:

You could try "bt 3230" or "sc 3230" to see the current stacktrace.

Thanks for the hint.

Still at r25256, it occurred at addr_connect test (another two earlier) this time:

kdebug> threads 174                                                             
thread         id  state        sem/cv cpu pri  stack      team  name           
0x90db7800    174  running        -      1  10  0x90ac9000  174  socket         
kdebug> sc 174                                                                  
stack trace for thread 174 "socket"                                             
    kernel stack: 0x90ac9000 to 0x90acd000                                      
      user stack: 0x7efe7000 to 0x7ffe7000                                      
frame            caller     <image>:function + offset                           
800d053c (+2146630340) 90b183e0

and after continue, during next F12:

kdebug> threads 174                                                             
thread         id  state        sem/cv cpu pri  stack      team  name           
0x90db7800    174  running        -      1  10  0x90ac9000  174  socket         
kdebug> sc 174                                                                  
stack trace for thread 174 "socket"                                             
    kernel stack: 0x90ac9000 to 0x90acd000                                      
      user stack: 0x7efe7000 to 0x7ffe7000                                      
frame            caller     <image>:function + offset                           
00000082 -- read fault

I assume this lack of sensible output indicates some kind of stack corruption?

  Changed 22 months ago by andreasf

Better luck at r25274 with SMP disabled, back to peek_close:

kdebug> threads 177                                                             
thread         id  state        sem/cv cpu pri  stack      team  name           
0x90dbd800    177  running        -      0  10  0x94314000  177  socket         
kdebug> sc 177                                                                  
stack trace for thread 177 "socket"
    kernel stack: 0x94314000 to 0x94318000
      user stack: 0x7efe7000 to 0x7ffe7000
frame            caller     <image>:function + offset
94317928 (+  52) 8008ace7   <kernel>:invoke_debugger_command + 0x00cf
9431795c (+  64) 8008ba90   <kernel>:_ParseCommand__16ExpressionParserRi + 0x01f8
9431799c (+  48) 8008b482   <kernel>:EvaluateCommand__16ExpressionParserPCcRi + 0x01de
943179cc (+ 228) 8008cba4   <kernel>:evaluate_debug_command + 0x0088
94317ab0 (+  64) 80089822   <kernel>:kernel_debugger_loop__Fv + 0x017a
94317af0 (+  48) 8008a4d5   <kernel>:kernel_debugger + 0x010d
94317b20 (+ 192) 8008a3bd   <kernel>:panic + 0x0029
94317be0 (+  48) 807b7df1   </boot/beos/system/add-ons/kernel/bus_managers/ps2>:ps2_interrupt + 0x00cd
94317c10 (+  64) 80028918   <kernel>:int_io_interrupt_handler + 0x00e0
94317c50 (+  48) 80097522   <kernel>:hardware_interrupt + 0x007a
94317c80 (+  12) 8009ab6d   <kernel>:int_bottom + 0x001d (nearest)
iframe at 0x94317c8c (end = 0x94317ce4)
 eax 0x90dbd800     ebx 0x90f01a9c      ecx 0x2129d      edx 0x0
 esi 0x94317dcc     edi 0x94317dcc      ebp 0x94317cf4   esp 0x94317cc0
 eip 0x80029171  eflags 0x282      
 vector: 0x21, error code: 0x0
94317c8c (+ 104) 80029171   <kernel>:mutex_unlock + 0x000d
94317cf4 (+  48) 802f135a   </boot/beos/system/add-ons/kernel/network/protocols/tcp>:Wait__8WaitListRQ28BPrivatet10AutoLocker2Z5mutexZQ28BPrivate12MutexLockingxb + 0x002e
94317d24 (+  64) 802f2d13   </boot/beos/system/add-ons/kernel/network/protocols/tcp>:_WaitForEstablished__11TCPEndpointRQ28BPrivatet10AutoLocker2Z5mutexZQ28BPrivate12Mutex7
94317d64 (+ 112) 802f1c20   </boot/beos/system/add-ons/kernel/network/protocols/tcp>:Connect__11TCPEndpointPC8sockaddr + 0x0234
94317dd4 (+  48) 802f097b   </boot/beos/system/add-ons/kernel/network/protocols/tcp>:tcp_connect__FP12net_protocolPC8sockaddr + 0x0023
94317e04 (+  48) 802d189d   </boot/beos/system/add-ons/kernel/network/stack>:socket_connect__FP10net_socketPC8sockaddrUi + 0x0055
94317e34 (+  48) 802d76f6   </boot/beos/system/add-ons/kernel/network/stack>:stack_interface_connect__FP10net_socketPC8sockaddrUi + 0x002e
94317e64 (+  48) 8004ad85   <kernel>:common_connect__FiPC8sockaddrUib + 0x0049
94317e94 (+ 176) 8004c178   <kernel>:_user_connect + 0x0080
94317f44 (+ 100) 8009ace2   <kernel>:pre_syscall_debug_done + 0x0002 (nearest)
iframe at 0x94317fa8 (end = 0x94318000)
 eax 0x82           ebx 0x3c89f8        ecx 0x7ffe6dd0   edx 0xffff0104
 esi 0x7ffe6e4c     edi 0x20            ebp 0x7ffe6e1c   esp 0x94317fdc
 eip 0xffff0104  eflags 0x200213   
 vector: 0x63, error code: 0x0
94317fa8 (+   0) ffff0104
7ffe6e1c (+  80) 002185c1   <libneon.so.25>:ne_addr_destroy + 0x0085 (nearest)
7ffe6e6c (+  64) 00218688   <libneon.so.25>:ne_sock_connect + 0x006c
7ffe6eac (+  48) 00202742   <socket>:_term_after + 0x0122 (nearest)
7ffe6edc (+  48) 00202870   <socket>:_term_after + 0x0250 (nearest)
7ffe6f0c (+  48) 00202e84   <socket>:_term_after + 0x0864 (nearest)
7ffe6f3c (+  64) 00204e1c   <socket>:main + 0x03c8
7ffe6f7c (+  48) 0020253f   <socket>:_start + 0x005b
7ffe6fac (+  48) 0010078e   2245:runtime_loader_seg0ro@0x00100000 + 0x78e
7ffe6fdc (+   0) 7ffe6fec   2244:socket_main_stack@0x7efe7000 + 0xffffec

  Changed 22 months ago by andreasf

With SMP re-enabled, back to single_read:

stack trace for thread 177 "socket"
    kernel stack: 0x94306000 to 0x9430a000
      user stack: 0x7efe7000 to 0x7ffe7000
frame            caller     <image>:function + offset
94309be8 (+  12) 8009ab6d   <kernel>:int_bottom + 0x001d (nearest)
iframe at 0x94309bf4 (end = 0x94309c4c)
 eax 0x200          ebx 0x20c03         ecx 0x1          edx 0x246
 esi 0x0            edi 0x284f          ebp 0x94309c44   esp 0x94309c28
 eip 0x800971a4  eflags 0x246      
 vector: 0xfd, error code: 0x0
94309bf4 (+  80) 800971a4   <kernel>:arch_int_restore_interrupts + 0x0020
94309c44 (+  32) 80028aa2   <kernel>:restore_interrupts + 0x0012
94309c64 (+  96) 80030824   <kernel>:release_sem_etc + 0x049c
94309cc4 (+  48) 800291b1   <kernel>:mutex_unlock + 0x004d
94309cf4 (+  48) 802f535a   </boot/beos/system/add-ons/kernel/network/protocols/tcp>:Wait__8WaitListRQ28BPrivatet10AutoLocker2Z5mutexZQ28BPrivate12MutexLockingxb + 0x002e
94309d24 (+  64) 802f6d13   </boot/beos/system/add-ons/kernel/network/protocols/tcp>:_WaitForEstablished__11TCPEndpointRQ28BPrivatet10AutoLocker2Z5mutexZQ28BPrivate12Mutex7
94309d64 (+ 112) 802f5c20   </boot/beos/system/add-ons/kernel/network/protocols/tcp>:Connect__11TCPEndpointPC8sockaddr + 0x0234
94309dd4 (+  48) 802f497b   </boot/beos/system/add-ons/kernel/network/protocols/tcp>:tcp_connect__FP12net_protocolPC8sockaddr + 0x0023
94309e04 (+  48) 802d589d   </boot/beos/system/add-ons/kernel/network/stack>:socket_connect__FP10net_socketPC8sockaddrUi + 0x0055
94309e34 (+  48) 802db6f6   </boot/beos/system/add-ons/kernel/network/stack>:stack_interface_connect__FP10net_socketPC8sockaddrUi + 0x002e
94309e64 (+  48) 8004ad85   <kernel>:common_connect__FiPC8sockaddrUib + 0x0049
94309e94 (+ 176) 8004c178   <kernel>:_user_connect + 0x0080
94309f44 (+ 100) 8009ace2   <kernel>:pre_syscall_debug_done + 0x0002 (nearest)
iframe at 0x94309fa8 (end = 0x9430a000)
 eax 0x82           ebx 0x3c89f8        ecx 0x5          edx 0x1800e360
 esi 0x7ffe6e3c     edi 0x20            ebp 0x7ffe6e0c   esp 0x94309fdc
 eip 0xffff0102  eflags 0x200213   
 vector: 0x63, error code: 0x0
94309fa8 (+   0) ffff0102
7ffe6e0c (+  80) 002185c1   <libneon.so.25>:ne_addr_destroy + 0x0085 (nearest)
7ffe6e5c (+  64) 00218688   <libneon.so.25>:ne_sock_connect + 0x006c
7ffe6e9c (+  48) 00202742   <socket>:_term_after + 0x0122 (nearest)
7ffe6ecc (+  48) 00202870   <socket>:_term_after + 0x0250 (nearest)
7ffe6efc (+  64) 00203103   <socket>:_term_after + 0x0ae3 (nearest)
7ffe6f3c (+  64) 00204e1c   <socket>:main + 0x03c8
7ffe6f7c (+  48) 0020253f   <socket>:_start + 0x005b
7ffe6fac (+  48) 0010078e   2220:runtime_loader_seg0ro@0x00100000 + 0x78e
7ffe6fdc (+   0) 7ffe6fec   2219:socket_main_stack@0x7efe7000 + 0xffffec

  Changed 22 months ago by andreasf

At r25278 (with Axel's network fixes) hangs in addr_connect test, without meaningful backtrace.

  Changed 22 months ago by axeld

  • priority changed from normal to critical
  • status changed from new to assigned
  • component changed from Network & Internet to Network & Internet/TCP
  • milestone changed from R1 to R1/alpha1

This might be the same problem that happens with the OpenSSH forwarding test - I will look into that in the next days.

  Changed 22 months ago by axeld

How does it look now, after r25294? At least the OpenSSH forwarding test now works here.

  Changed 22 months ago by andreasf

At r25294 in peek_close:

kdebug> sc 51767
stack trace for thread 51767 "socket"
    kernel stack: 0x95449000 to 0x9544d000
      user stack: 0x7efe7000 to 0x7ffe7000
frame            caller     <image>:function + offset
9544cbf4 (+  32) 8002f112   <kernel>:context_switch__FP6threadT0 + 0x0026
9544cc14 (+  64) 8002f39c   <kernel>:scheduler_reschedule + 0x0268
9544cc54 (+  32) 8003c48c   <kernel>:thread_block_with_timeout_locked + 0x00d0
9544cc74 (+  80) 800306a0   <kernel>:switch_sem_etc + 0x0384
9544ccc4 (+  64) 800302ee   <kernel>:acquire_sem_etc + 0x0026
9544cd04 (+  64) 802f7392   </boot/beos/system/add-ons/kernel/network/protocols/tcp>:Wait__8WaitListRQ28BPrivatet10AutoLocker2Z5mutexZQ28BPrivate12MutexLockingx + 0x006a
9544cd44 (+  48) 802f8e2a   </boot/beos/system/add-ons/kernel/network/protocols/tcp>:_WaitForEstablished__11TCPEndpointRQ28BPrivatet10AutoLocker2Z5mutexZQ28BPrivate12Mutexe
9544cd74 (+  96) 802f7a35   </boot/beos/system/add-ons/kernel/network/protocols/tcp>:Connect__11TCPEndpointPC8sockaddr + 0x0071
9544cdd4 (+  48) 802f697b   </boot/beos/system/add-ons/kernel/network/protocols/tcp>:tcp_connect__FP12net_protocolPC8sockaddr + 0x0023
9544ce04 (+  48) 802d7d41   </boot/beos/system/add-ons/kernel/network/stack>:socket_connect__FP10net_socketPC8sockaddrUi + 0x0055
9544ce34 (+  48) 802ddc0e   </boot/beos/system/add-ons/kernel/network/stack>:stack_interface_connect__FP10net_socketPC8sockaddrUi + 0x002e
9544ce64 (+  48) 8004b3d5   <kernel>:common_connect__FiPC8sockaddrUib + 0x0049
9544ce94 (+ 176) 8004c7c8   <kernel>:_user_connect + 0x0080
9544cf44 (+ 100) 8009b682   <kernel>:pre_syscall_debug_done + 0x0002 (nearest)
iframe at 0x9544cfa8 (end = 0x9544d000)
 eax 0x82           ebx 0x3c89f8        ecx 0x5          edx 0x1800fbe0
 esi 0x7ffe6e30     edi 0x20            ebp 0x7ffe6dfc   esp 0x9544cfdc
 eip 0xffff0102  eflags 0x200217   
 vector: 0x63, error code: 0x0
9544cfa8 (+   0) ffff0102
7ffe6dfc (+ 112) 00218582   <libneon.so.25>:ne_addr_destroy + 0x00c6 (nearest)
7ffe6e6c (+  64) 0021864f   <libneon.so.25>:ne_sock_connect + 0x0073
7ffe6eac (+  48) 00202712   <socket>:_term_after + 0x0122 (nearest)
7ffe6edc (+  48) 00202840   <socket>:_term_after + 0x0250 (nearest)
7ffe6f0c (+  48) 00203014   <socket>:_term_after + 0x0a24 (nearest)
7ffe6f3c (+  64) 00204fac   <socket>:main + 0x03c8
7ffe6f7c (+  48) 0020250f   <socket>:_start + 0x005b
7ffe6fac (+  48) 0010078e   1044226:runtime_loader_seg0ro@0x00100000 + 0x78e
7ffe6fdc (+   0) 7ffe6fec   1044225:socket_main_stack@0x7efe7000 + 0xffffec

Then after continue something new: FAIL (count not connect to server: Operation would block

I'll recheck whether #2195 had any influence on this, and whether my F12 had any influence on it or whether I was just too impatient by now.

  Changed 22 months ago by andreasf

Same revision, after restart, without Pe started, immediately passes peek_close but runs into a ca. two minute timeout in read_and_peek test, again leading to:

FAIL (could not connect to server: Operation would block)

Next try, one test earlier, small_reads, and in the same run in line_overlow and line_mingle (which both passed the run before). Still looks a little random, but CPU is no longer 100% (~0%) and it times out at all.

I noticed for some test earlier that client and server were being set up on the same (single) thread, could that be a problem? Seems like a non-standard scenario.

  Changed 21 months ago by bonefish

  • status changed from assigned to closed
  • resolution set to fixed

Fixed in r25526.

Note: See TracTickets for help on using tickets.