Opened 11 years ago

Closed 9 years ago

#2598 closed bug (fixed)

sshd occasionally segfaults on connect (possibly when forking)

Reported by: anevilyak Owned by: bonefish
Priority: normal Milestone: R1
Component: System/Kernel Version: R1/pre-alpha1
Keywords: Cc:
Blocked By: Blocking:
Has a Patch: no Platform: All

Description

Once in a while when connecting via ssh, the sshd segfaults. However, neither gdb nor debug_server seem to be able to make heads or tails of it, gdb displays several hundred stack entries all of which are ?? followed by

Previous frame inner to this frame (corrupt stack?)

Whereas debug_server lists the following in syslog:

KERN: vm_soft_fault: va 0xb000 not covered by area in address space
KERN: vm_page_fault: vm_soft_fault returned error 'Bad address' on fault at 0xb7d6, ip 0xb7d6, write 0, user 1, thread 0xe3
KERN: vm_page_fault: thread "sshd" (227) in team "sshd" (227) tried to read address 0xb7d6, ip 0xb7d6 ("???" +0xb7d6)
KERN: debug_server: TeamDebugHandler::Init(): Failed to get executable path of team 227: No such file or directory
KERN: debug_server: Thread 227 entered the debugger: Segment violation
KERN: stack trace, current PC 0xb7d6  :
KERN:   (0x7ffeea84)  0x7ffeeaa0  (sshd_main_stack + 0xfffaa0)
KERN:   (0x7ffeed4c)  0x20e4f1  (sshd_seg0ro + 0xe4f1)
KERN:   (0x7ffeee2c)  0x20fbe2  (sshd_seg0ro + 0xfbe2)
KERN:   (0x7ffeef7c)  0x20c9af  (sshd_seg0ro + 0xc9af)
KERN:   (0x7ffeefac)  0x1008ea  (runtime_loader_seg0ro + 0x8ea)

Attachments (1)

sshd_debug_log (17.9 KB) - added by anevilyak 11 years ago.
Log from entire sshd session.

Download all attachments as: .zip

Change History (41)

comment:1 Changed 11 years ago by anevilyak

Owner: changed from dlmcpaul to bonefish

Wrong owner.

comment:2 Changed 11 years ago by bonefish

Unfortunately the ssh executable seems to be stripped, so the stack trace is not telling that much. "sshd_seg0ro + 0xe4f1" is the return address of a select() call. Since select() basically only calls the syscall, nothing can really go wrong. The IP definitely shouldn't point to the stack. My best guess is that a signal occurred during the syscall and a signal handler was (or should have been) invoked, and something went wrong.

You don't have by any chance syscall and signal tracing enabled? That would at least help to verify the theory.

The description reads like this bug is not really new. Do you have an estimate when you first encountered it?

comment:3 Changed 11 years ago by anevilyak

Honestly, for at least 2 months now, it's just that the last time (when I mentioned it to you in email since I didn't think the completely bogus trace from gdb would be helpful) I didn't know about the trace the debug_server puts out. If it helps it does seem to occur more frequently if I do something that involves sending/receiving a large amount of data across a previous session, log out, and then log back in, for instance doing an svn checkout from a session. It never occurs in an already active session, only during login. I'm relatively certain I always leave syscall/signal tracing enabled so I'll check the trace output on those next time.

comment:4 Changed 11 years ago by anevilyak

I'm curious by the way, what's the "KERN: debug_server: TeamDebugHandler::Init(): Failed to get executable path of team 227: No such file or directory" error all about? Is that because of stripping also or is it something else?

comment:5 Changed 11 years ago by anevilyak

Caught it again, trace output as follows:

kdebug>traced 0 30 -1 filter team 195
4014894. [   195] 16916220581: syscall post: _kern_select() -> 0x8000000a
4014895. [   195] 16916220582: signal handle:  0x10
4014896. [   195] 16916220582: signal exec handler: signal: 5, handler: 0x0020cc88
4014984. [   195] 16916248011: signal send: target: 64907, signal: 12 (CONT), flags: 0x20000
printed 4 entries within range 1 to 4100900 (4100900 of 4100900 total, 135147852 ever)
kdebug>

comment:6 Changed 11 years ago by anevilyak

64907 btw is the "team 195 debug task" thread.

comment:7 in reply to:  4 Changed 11 years ago by bonefish

Status: newassigned

Replying to anevilyak:

I'm curious by the way, what's the "KERN: debug_server: TeamDebugHandler::Init(): Failed to get executable path of team 227: No such file or directory" error all about? Is that because of stripping also or is it something else?

No, I think it's just another bug. When running sshd listimage doesn't show any images (through which the debug server normally gets the executable name). Maybe something we miss to do on fork(). Needs looking into.

Replying to anevilyak:

64907 btw is the "team 195 debug task" thread.

Yeah, that happens when the debugger is installed for the team.

Anyway, it looks as suspected: The thread gets a SIGCHLD (probably after fork()) while doing a select(), the signal handler is invoked, and the thread crashes. 28 ms pass, though, which seems quite a bit of time, but maybe the system was a bit busy.

Looking into the ssh sources (particularly how the signal handler is installed and what it does) might yield some new leads. Will do that.

comment:8 Changed 11 years ago by anevilyak

It shouldn't have been busy, the system was idle at the time, though it's possible the page writer was still writing things to disk as it had just finished a jam -aq.

comment:9 Changed 11 years ago by bonefish

Component: Applications/Command Line ToolsSystem/Kernel

The SIGCHLD handlers in the OpenSSH code look rather unsuspicious. So this is most likely a kernel issue. Not sure what could cause it, though.

comment:10 Changed 11 years ago by anevilyak

I can reproduce it pretty reliably here... is there any extra information/tracing I can enable that would help? (for reference, it's happened to me quite consistently if I did the following steps:

  • log into ssh remotely.
  • . /system/boot/SetupEnvironment
  • svn+ssh checkout of haiku sources
  • configure --use-gcc-pipe --include-3rdparty
  • jam -q
  • log out / log back in -> crash.

Unfortunately somewhat time consuming steps but...I haven't yet managed to find a way to come up with a smaller test case. A checkout alone didn't do it anyways.

comment:11 Changed 11 years ago by anevilyak

I just realized something: why is it receiving a SIGCHLD at that point in the first place? I thought that signal is generally used to indicate that a child exited, whereas if this crash happens on login one would expect a child to have just been created if I understand the way the sshd works correctly.

comment:12 Changed 11 years ago by bonefish

In Haiku SIGCHLD is also sent when a child is stopped (in other OSs also when a child is continued), though that probably doesn't happen in this case.

The ssh connection initiation is somewhat complex, which, I believe, has to do with how the privilege separation works. According to the team tracing, the daemon forks and re-executes. This new process forks off a child, which exits a bit later (probably after authentication), and finally forks off another child executing the shell. So the exiting child seems to be normal.

ATM I've no clue how things can go wrong. The signal handler should execute several syscalls, which should show up in the tracing, but don't. So one has to assume the signal handler is somehow set up incorrectly. An incorrect stack wouldn't explain why no syscall shows up, though, since as long as at least the eip is set correctly, the correct code should be executed. I'll probably have to add more debugging means to understand this better.

comment:13 Changed 11 years ago by anevilyak

Understood. Is there any way you could build a non-stripped set of OpenSSH binaries, as that might at least yield some hints as to what calls were being made at the time? Any other debug tests you'd suggest I could run in the meantime? maybe turning on the extra debug logging options in sshd? Please let me know what I can do to help track this one down :)

comment:14 Changed 11 years ago by anevilyak

Hi Ingo,

I tried this test again but using sshd -ddd, and got slightly different but still interesting behavior. This time it didn't actually segfault, but sshd did somehow completely exit after I logged out the session: last few lines of debugging output from sshd:

debug2: channel 0: rcvd adjust 32778
debug1: server_input_channel_req: channel 0 request window-change reply 0
debug1: session_by_channel: session 0 channel 0
debug1: session_input_channel_req: session 0 req window-change
debug1: server_input_channel_req: channel 0 request window-change reply 0
debug1: session_by_channel: session 0 channel 0
debug1: session_input_channel_req: session 0 req window-change
debug1: Received SIGCHLD.
debug1: session_by_pid: pid 231
debug1: session_exit_message: session 0 channel 0 pid 231
debug2: channel 0: request exit-status confirm 0
debug1: session_exit_message: release channel 0
debug2: channel 0: write failed
debug2: channel 0: close_write
debug2: channel 0: output open -> closed
debug1: session_pty_cleanup: session 0 release /dev/tt/p1
debug2: channel 0: read<=0 rfd 7 len 0
debug2: channel 0: read failed
debug2: channel 0: close_read
debug2: channel 0: input open -> drain
debug2: channel 0: ibuf empty
debug2: channel 0: send eof
debug2: channel 0: input drain -> closed
debug2: channel 0: send close
debug2: notify_done: reading
debug3: channel 0: will not send data after close
debug2: channel 0: rcvd close
debug3: channel 0: will not send data after close
debug2: channel 0: is dead
debug2: channel 0: gc: notify user
debug1: session_by_channel: session 0 channel 0
debug1: session_close_by_channel: channel 0 child 0
debug1: session_close: session 0 pid 0
debug2: channel 0: gc: user detached
debug2: channel 0: is dead
debug2: channel 0: garbage collecting
debug1: channel 0: free: server-session, nchannels 1
debug3: channel 0: status: The following connections are open:
  #0 server-session (t4 r0 i3/0 o3/0 fd -1/-1 cfd -1)

debug3: channel 0: close_fds r -1 w -1 e -1 c -1
Connection closed by 192.168.1.126
debug1: do_cleanup
Closing connection to 192.168.1.126

Changed 11 years ago by anevilyak

Attachment: sshd_debug_log added

Log from entire sshd session.

comment:15 Changed 11 years ago by anevilyak

Possibly related, possibly a different bug, but in revisions of the last week or so I've been seeing some new behavior: on large data transfers (read: > 1GB or so), sshd keeps spontaneously terminating itself with the following error message:

Disconnecting: Corrupted MAC on input.

I haven't seen any changes to the network stack that would influence something like this, and I didn't think the I/O scheduler framework was in any way involved so this one confuses me.

comment:16 Changed 11 years ago by anevilyak

Confirmed, that appears to have been caused by 26980/26981. Will file a new bug.

comment:17 Changed 11 years ago by anevilyak

Since the segfault problem still remains though...have the patches for OpenSSH to build/work with Haiku been submitted upstream? If so I could pull the source and run the testsuites and such myself to see if I can help track this down.

comment:18 in reply to:  17 Changed 11 years ago by bonefish

Replying to anevilyak:

Since the segfault problem still remains though...have the patches for OpenSSH to build/work with Haiku been submitted upstream? If so I could pull the source and run the testsuites and such myself to see if I can help track this down.

Nope, the patches are available in our repository in buildtools/patches/openssh/, though.

BTW, weird, I definitely wrote a reply to your comment 6 days ago. I wonder what happened. Anyway, I was only saying that I don't think the problems are related -- but it seems you found the causing revision already -- and that the SSH sources don't really look helpful at the place where the error message is emitted.

comment:19 Changed 11 years ago by marcusoverhagen

What does the kernel print when it enters KDL (first 20 lines should do)

Dumping the stack content, perhaps +/- 100 byte around esp could give some info, too.

comment:20 in reply to:  19 Changed 11 years ago by anevilyak

Replying to marcusoverhagen:

What does the kernel print when it enters KDL (first 20 lines should do)

Dumping the stack content, perhaps +/- 100 byte around esp could give some info, too.

Do you mean by jumping into KDL when sshd segfaults and then asking for the team stack trace for sshd or what? In any case, I won't be able to check for a few days, I'm in the middle of moving at the moment so all my computers are packed.

Regards,

Rene

comment:21 Changed 11 years ago by anevilyak

OK so, a non-stripped binary behaves exactly the same way (165 line backtrace, most of which are ??'s). However, I do notice that every single time it crashes, the trace starts in select() in one of the earlier frames. I seem to remember mmu_man hitting some problems in select() in netsurf also though. The stack trace is oddly filled with addresses like 0xcccccccc and 0xffffffff which to me look like malloc debugging-related fillers.

comment:22 Changed 11 years ago by anevilyak

Also, syscall/signal tracing looks identical to that posted in a previous comment, post select, sigchld and all.

comment:23 Changed 11 years ago by bonefish

I believe the gdb stack trace is bad due to a missing feature -- the handling of signal trampoline code is currently missing, so it probably restores the stack frame's registers incorrectly, thus totally screwing things up later. The debug server stack trace should still be fine. The stack trace from the kernel debugger won't be able to resolve static symbols, but it will also include all register values at the point of the crash.

Unlike I wrote earlier, the stack address in the return addresses of the stack trace is totally fine. When a signal handler is invoked, the kernel copies a bit of code onto the stack that will be executed when the signal handler returns. The stack trace suggests that the stack frame of the signal handler function still exists, i.e. it hasn't returned yet.

You could do the following to investigate this further:

  • Add a few ktrace_printf()s in the signal handler function. Hopefully helps to understand where it crashes.
  • Get a stack trace from KDL. Maybe knowing the registers helps somehow.
  • Disassemble the code at the position of the signal handler. The (signal) kernel tracing gives you the address. in_context <thread ID> dis <address> will do the trick. This should help to rule out that someone scrambled the memory.
  • Dump a bit of the stack. In doubt up to the address of the code on the stack (0x7ffeeaa0 in the above stack trace), e.g. in_context <thread ID> dw <address> 32.

Good luck!

comment:24 Changed 11 years ago by anevilyak

Thanks for the tips, will see what I can find out with that. Unfortunately I lost the stack trace for this last run in syslog because mine is flooded with debug messages from the page writer (perhaps unintentionally left enabled)?

comment:25 Changed 11 years ago by anevilyak

OK, got the crash again, trace from syslog as follows (note, this time it happened in mid data transfer, not on forking a new child connection though):

USER: Disconnecting: Bad packet length 905182364.
KERN: vm_soft_fault: va 0xb2000 not covered by area in address space
KERN: vm_page_fault: vm_soft_fault returned error 'Bad address' on fault at 0xb2ee0, ip 0x40f159, write 0, user 1, thread 0x14dec
KERN: vm_page_fault: thread "sshd" (85484) in team "sshd" (85484) tried to read address 0xb2ee0, ip 0x40f159 ("libroot.so_seg0ro" +0x33159)
KERN: stack trace:
KERN:   0x0040fbaaKERN:  (libroot.so_seg0ro + 0x33baa)
KERN:   0x004112e1 (libroot.so_seg0ro + 0x352e1)
KERN:   0x0023b2ca (sshd_seg0ro + 0x3b2ca)
KERN:   0x0020e457 (sshd_seg0ro + 0xe457)
KERN:   0x0020fc63 (sshd_seg0ro + 0xfc63)
KERN:   0x0020ca1f (sshd_seg0ro + 0xca1f)
KERN:   0x001008eaKERN:  (runtime_loader_seg0ro + 0x8ea)
KERN:   0x7ffeefec (sshd_main_stack + 0xffffec)
KERN: vm_soft_fault: va 0x0 not covered by area in address space
KERN: vm_page_fault: vm_soft_fault returned error 'Bad address' on fault at 0x0, ip 0x800cde34, write 0, user 0, thread 0x14dec
KERN: debug_server: Thread 85484 entered the debugger: Segment violation
KERN: stack trace, current PC 0x40f159  freeBlock__Q28BPrivate9hoardHeapRPQ28BPrivate5blockRPQ28BPrivate10superblockiPQ28BPrivate11processHeap + 0x879:
KERN:   (0x7ffeec9c)  0x40fbaa  free__Q28BPrivate11processHeapPv + 0x1ee
KERN:   (0x7ffeecec)  0x4112e1  free + 0xb1
KERN:   (0x7ffeed1c)  0x23b2ca  xfree + 0x36
KERN:   (0x7ffeed4c)  0x20e457  server_accept_loop + 0x143
KERN:   (0x7ffeee1c)  0x20fc63  main + 0x1137
KERN:   (0x7ffeef7c)  0x20ca1f  _start + 0x5b
KERN:   (0x7ffeefac)  0x1008ea  runtime_loader + 0x126

comment:26 Changed 11 years ago by anevilyak

Kernel stack trace:

stack trace for thread 85484 "sshd"
    kernel stack: 0x801b5000 to 0x801b9000
      user stack: 0x7efef000 to 0x7ffef000
frame            caller     <image>:function + offset
 0 801b87ac (+  48) 80045ef1   <kernel_x86>:context_switch__FP6threadT0 + 0x0039
 1 801b87dc (+  64) 80046170   <kernel_x86>:scheduler_reschedule + 0x0248
 2 801b881c (+  64) 80047504   <kernel_x86>:switch_sem_etc + 0x036c
 3 801b885c (+  64) 8004716a   <kernel_x86>:acquire_sem_etc + 0x0026
 4 801b889c (+  80) 800446d6   <kernel_x86>:read_port_etc + 0x0152
 5 801b88ec (+1424) 80061f4a   <kernel_x86>:thread_hit_debug_event_internal__F22debug_debugger_messagePCvlbRb + 0x02fa
 6 801b8e7c (+  64) 800621bf   <kernel_x86>:thread_hit_debug_event__F22debug_debugger_messagePCvlb + 0x002f
 7 801b8ebc (+  80) 800623ab   <kernel_x86>:user_debug_exception_occurred + 0x005b
 8 801b8f0c (+  80) 800bb374   <kernel_x86>:vm_page_fault + 0x0314
 9 801b8f5c (+  64) 800cab15   <kernel_x86>:page_fault_exception + 0x00d9
10 801b8f9c (+  12) 800ce1cf   <kernel_x86>:int_bottom_user + 0x005a
user iframe at 0x801b8fa8 (end = 0x801b9000)
 eax 0xb2ee0        ebx 0x48d4fc        ecx 0x180015a0   edx 0x180015a0
 esi 0x0            edi 0x1802f928      ebp 0x7ffeec9c   esp 0x801b8fdc
 eip 0x40f159    eflags 0x210297   user esp 0x7ffeec64
 vector: 0xe, error code: 0x4
11 801b8fa8 (+   0) 0040f159   </boot/beos/system/lib/libroot.so@0x003dc000>:unknown + 0x33159
12 7ffeec9c (+  80) 0040fbaa   </boot/beos/system/lib/libroot.so@0x003dc000>:unknown + 0x33baa
13 7ffeecec (+  48) 004112e1   </boot/beos/system/lib/libroot.so@0x003dc000>:unknown + 0x352e1
14 7ffeed1c (+  48) 0023b2ca   </boot/home/config/bin/sshd@0x00200000>:unknown + 0x3b2ca
15 7ffeed4c (+ 208) 0020e457   </boot/home/config/bin/sshd@0x00200000>:unknown + 0xe457
16 7ffeee1c (+ 352) 0020fc63   </boot/home/config/bin/sshd@0x00200000>:unknown + 0xfc63
17 7ffeef7c (+  48) 0020ca1f   </boot/home/config/bin/sshd@0x00200000>:unknown + 0xca1f
18 7ffeefac (+  48) 001008ea   </boot/beos/system/runtime_loader@0x00100000>:unknown + 0x08ea
19 7ffeefdc (+   0) 7ffeefec   2599177:sshd_main_stack@0x7efef000 + 0xffffec

traced 0 30 -1 filter team 85484 turns up nothing in this case, oddly enough, so I'm not sure where to start disassembling at.

comment:27 Changed 11 years ago by anevilyak

In case it's at all useful, disassembly at the program counter given in debug_server's trace:

kdebug> in_context 85484 dis 0x40f159
0x0040f159:           8b34b0    mov (%eax,%esi,4), %esi
0x0040f15c:             85f6    test %esi, %esi
0x0040f15e:             7522    jnz 0x40f182
0x0040f160:     8d838433feff    lea 0xfffe3384(%ebx), %eax
0x0040f166:               50    push %eax
0x0040f167:       68a6010000    push $0x1a6
0x0040f16c:     8d834430feff    lea 0xfffe3044(%ebx), %eax
0x0040f172:               50    push %eax
0x0040f173:     8d83b333feff    lea 0xfffe33b3(%ebx), %eax
0x0040f179:               50    push %eax

comment:28 Changed 11 years ago by anevilyak

I just noticed also that in this particular case the backtrace was accessible from gdb:

(gdb) bt
#0  0x0040f159 in BPrivate::hoardHeap::freeBlock ()
   from /boot/beos/system/lib/libroot.so
#1  0x0040fbaa in BPrivate::processHeap::free ()
   from /boot/beos/system/lib/libroot.so
#2  0x004112e1 in free () from /boot/beos/system/lib/libroot.so
#3  0x0023b2ca in xfree (ptr=0x18031918) at xmalloc.c:81
#4  0x0020e457 in server_accept_loop (sock_in=0x7ffeef24, sock_out=0x7ffeef28, 
    newsock=0x7ffeef20, config_s=0x7ffeef2c) at sshd.c:1058
#5  0x0020fc63 in main (ac=1, av=0x1800e390) at sshd.c:1651

comment:29 Changed 11 years ago by anevilyak

Got it with a new connection this time, same stack trace as shown in the previous case involving hoard though. This time I actually have tracing info though, as follows:

traced 0 30 -1 filter team 287
3774754. [   287] 2703350606: syscall post: _kern_select() -> 0x8000000a
3774755. [   287] 2703350607: signal handle:  0x10
3774756. [   287] 2703350607: signal exec handler: signal: 5, handler: 0x0020ccf8
3774757. [   287] 2703350630: syscall pre:  _kern_wait_for_child(0xffffffff, 0x1, 0x7ffeea2c, 0x7ffeea30)
3774758. [   287] 2703350636: syscall post: _kern_wait_for_child() -> 0x120
3774759. [   287] 2703350637: syscall pre:  _kern_wait_for_child(0xffffffff, 0x1, 0x7ffeea2c, 0x7ffeea30)
3774760. [   287] 2703350637: syscall post: _kern_wait_for_child() -> 0x80007002
3774761. [   287] 2703350645: syscall pre:  _kern_sigaction(0x5, 0x0, 0x7ffeea14)
3774762. [   287] 2703350646: syscall post: _kern_sigaction() -> 0x0
3774763. [   287] 2703350647: syscall pre:  _kern_restore_signal_frame()
3774764. [   287] 2703350649: syscall post: _kern_select() -> 0x8000000a
3774973. [   287] 2703423602: signal send: target: 23132, signal: 12 (CONT), flags: 0x20000
printed 12 entries within range 1 to 3958033 (3958033 of 3958033 total, 26863044 ever)

Assuming I understood the address you want disassembly from correctly out of that, disassembly output as follows:

kdebug> in_context 287 dis 0x0020ccf8
0x0020ccf8:               55    push %ebp
0x0020ccf9:             89e5    mov %esp, %ebp
0x0020ccfb:           83ec1c    sub $0x1c, %esp
0x0020ccfe:               57    push %edi
0x0020ccff:               56    push %esi
0x0020cd00:               53    push %ebx
0x0020cd01:       e800000000    call 0x20cd06
0x0020cd06:               5b    pop %ebx
0x0020cd07:     81c3c6370500    add $0x537c6, %ebx
0x0020cd0d:       e832f7ffff    call 0x20c444
kdebug>

last address on the stack in this case was 0x7ffeec9c, dw dump:

kdebug> in_context 287 dw 0x7ffeec9c 32
[0x7ffeec9c]  ìì..ª.@.d+...ì..   7ffeecec 0040fbaa 18002b64 7ffeece4
[0x7ffeecac]  èì..........3..   7ffeece8 00000000 18000000 80060633
[0x7ffeecbc]  =@
                [0x7ffeeccc]  .....ÔH.....s...   00000001 0048d4fc 00000001 00000073
[0x7ffeecdc]  ............(...   ffff0104 ffffffff 18031910 1802f928
[0x7ffeecec]  .í..á.A.........   7ffeed1c 004112e1 18000000 18031918
[0x7ffeecfc]  ÌÌÌÌ=.A.Ì.&.....   cccccccc 0041123d 002604cc 18031918
[0x7ffeed0c]  ................   00000000 00000000 ffffffff ffffffff

Leaving box in kernel debugger for the time being in case further information is helpful. Note that at this point I can pretty much reproduce the crash 100% of the time by simply ssh'ing into the box via putty from my windows laptop and running jam -aq on the haiku tree. Either late in the build process, or by logging out and back in after the build completes, it will crash every time.

comment:30 Changed 11 years ago by bonefish

Unfortunately this looks all completely different from the earlier cases. The program does no longer crash with an invalid PC address while in the signal handler, but somewhere in the memory allocator. The signal handler has already finished according to the kernel tracing. The "dw" output looks a bit scrambled; not that I knew what to look for in this new situation.

If the crashes continue to happen in the memory allocator, you could compile libroot with debug info, copy the sources over (at least the ones from the allocator (src/system/libroot/posix/malloc)), and have a closer look with gdb.

It would also be interesting why the behavior has changed now. Did you do the previous test before hrev27925? That's at least something that changed in the allocator. If not, setting kHeapIncrement to B_PAGE_SIZE (in .../malloc/arch_specific.cpp) might bring back the old behavior. Depending on whether a gdb session turns something up with the current behavior, it might be easier to analyze the old one, since it crashes meaner and probably earlier.

comment:31 in reply to:  30 Changed 11 years ago by anevilyak

Replying to bonefish:

It would also be interesting why the behavior has changed now. Did you do the previous test before hrev27925? That's at least something that changed in the allocator. If not, setting kHeapIncrement to B_PAGE_SIZE (in .../malloc/arch_specific.cpp) might bring back the old behavior. Depending on whether a gdb session turns something up with the current behavior, it might be easier to analyze the old one, since it crashes meaner and probably earlier.

That's actually possible, I seem to remember I still got the old behavior only a few days ago. Crashing in flight like it is now is new for me though, it used to only crash on new connection, never in mid data transfer. Though I think mmlr said he saw problems like that on his C2Q sometimes. Will try changing the heap increment tonight and see what happens.

comment:32 Changed 11 years ago by anevilyak

Hm...changing the heap increment to B_PAGE_SIZE seems to have made no difference at all, still consistently crashing in hoard now. Will try to roll back commits a bit and see where the difference lies.

comment:33 Changed 11 years ago by anevilyak

Finally hit this crash again with a current build. Oddly I've confirmed one thing: it only happens with a stripped build, the one with symbols results in the allocator crash every single time.

In any case, backtrace of sshd from the kernel:

stack trace for thread 264 "sshd"
    kernel stack: 0xa1f3a000 to 0xa1f3e000
      user stack: 0x7efef000 to 0x7ffef000
frame               caller     <image>:function + offset
 0 a1f3d7ac (+  48) 80046020   <kernel_x86> context_switch(thread*: 0x9fd1d800, thread*: 0x9fd2e000) + 0x003c
 1 a1f3d7dc (+  64) 800462cc   <kernel_x86>:scheduler_reschedule + 0x0264
 2 a1f3d81c (+  64) 80047660   <kernel_x86>:switch_sem_etc + 0x036c
 3 a1f3d85c (+  64) 800472c6   <kernel_x86>:acquire_sem_etc + 0x0026
 4 a1f3d89c (+  80) 80044802   <kernel_x86>:read_port_etc + 0x0152
 5 a1f3d8ec (+1424) 800621be   <kernel_x86> debug_debugger_message::thread_hit_debug_event_internal(0x8, int32: -1577853208, true, 0x8006d801) + 0x02fa
 6 a1f3de7c (+  64) 80062433   <kernel_x86> debug_debugger_message::thread_hit_debug_event(0x8, int32: -1577853208, true) + 0x002f
 7 a1f3debc (+  80) 8006261f   <kernel_x86>:user_debug_exception_occurred + 0x005b
 8 a1f3df0c (+  80) 800bbad0   <kernel_x86>:vm_page_fault + 0x0314
 9 a1f3df5c (+  64) 800cb655   <kernel_x86>:page_fault_exception + 0x00d9
10 a1f3df9c (+  12) 800cecbf   <kernel_x86>:int_bottom_user + 0x005a
user iframe at 0xa1f3dfa8 (end = 0xa1f3e000)
 eax 0x0            ebx 0x25fe04        ecx 0x7ffeed10   edx 0x2
 esi 0x7ffeea80     edi 0x20cc88        ebp 0x7ffeea84   esp 0xa1f3dfdc
 eip 0xb7d6      eflags 0x210213   user esp 0x7ffeea48
 vector: 0xe, error code: 0x4
11 a1f3dfa8 (+   0) 0000b7d6
12 7ffeea84 (+ 712) 7ffeeaa0   6533:sshd_main_stack@0x7efef000 + 0xfffaa0
13 7ffeed4c (+ 224) 0020e4f1   </boot/home/config/bin/sshd@0x00200000>:unknown + 0xe4f1
14 7ffeee2c (+ 336) 0020fbe2   </boot/home/config/bin/sshd@0x00200000>:unknown + 0xfbe2
15 7ffeef7c (+  48) 0020c9af   </boot/home/config/bin/sshd@0x00200000>:unknown + 0xc9af
16 7ffeefac (+  48) 001008ea   </boot/beos/system/runtime_loader@0x00100000>:unknown + 0x08ea
17 7ffeefdc (+   0) 7ffeefec   6533:sshd_main_stack@0x7efef000 + 0xffffec

For some reason, I have no trace entries whatsoever for that team though, so the only other useful thing I can do is dump memory at the stack address:

kdebug> in_context 264 dw 0x7ffeeaa0 32
[0x7ffeeaa0]  ....9.........T$   b80cc483 00000039 000000b9 24548d00
[0x7ffeeab0]  .Íc....... ....   c363cd04 ffff0104 00200217 8000000a
[0x7ffeeac0]  .í......Üßó¡Lí..   7ffeed10 00000000 a1f3dfdc 7ffeed4c
[0x7ffeead0]  .í.......@......   7ffeed10 ffff037f ffff4000 ffffffff
[0x7ffeeae0]  íVK...}.=.p#...   004b56ed 077d001b 70143d84 ffff0023
[0x7ffeeaf0]  Ú...............   0d0d0bda 00000000 0000ffff 00000000
[0x7ffeeb00]  ................   ffff0000 00000000 00000000 0000ffff
[0x7ffeeb10]  ................   00000000 ffff0000 00000000 00000000
kdebug>

Hope this helps in some way.

comment:34 Changed 11 years ago by anevilyak

Interestingly, something recently seems to have broken sshd entirely here: It now no longer appears to sense a socket connection at all for some reason, at least sshd -ddd doesn't indicate having picked up an inbound connection at all. Could this possibly be due to the recent package updates? I can try to binary search and find a responsible revision if it'd help.

comment:35 Changed 10 years ago by anevilyak

And now it works again, though I have no idea which change might have unbroken it. Not sure if the segfault still happens or not, will let you know.

comment:36 Changed 9 years ago by bonefish

Status: in-progressassigned

Not working on this one.

comment:37 Changed 9 years ago by anevilyak

Out of curiosity, given that this only seemed to happen (for me) after days of uptime, is it at all possible that it was a victim of that NewOS-originated VM bug you recently fixed? I can try and set up another extended uptime test to see if it's still reproducible.

comment:38 in reply to:  37 ; Changed 9 years ago by bonefish

Replying to anevilyak:

Out of curiosity, given that this only seemed to happen (for me) after days of uptime, is it at all possible that it was a victim of that NewOS-originated VM bug you recently fixed? I can try and set up another extended uptime test to see if it's still reproducible.

If you mean hrev36523 -- that would be very unlikely (probably even impossible) in single-threaded teams. Since a few things have been fixed/rewritten in the meantime, it is possible that this bug has been fixed, too.

comment:39 in reply to:  38 Changed 9 years ago by anevilyak

Replying to bonefish:

If you mean hrev36523 -- that would be very unlikely (probably even impossible) in single-threaded teams. Since a few things have been fixed/rewritten in the meantime, it is possible that this bug has been fixed, too.

Will report back in a few days with results then.

comment:40 Changed 9 years ago by anevilyak

Resolution: fixed
Status: assignedclosed

No longer reproducible, closing.

Note: See TracTickets for help on using tickets.