Opened 16 years ago
Closed 15 years ago
#2598 closed bug (fixed)
sshd occasionally segfaults on connect (possibly when forking)
Reported by: | anevilyak | Owned by: | bonefish |
---|---|---|---|
Priority: | normal | Milestone: | R1 |
Component: | System/Kernel | Version: | R1/pre-alpha1 |
Keywords: | Cc: | ||
Blocked By: | Blocking: | ||
Platform: | All |
Description
Once in a while when connecting via ssh, the sshd segfaults. However, neither gdb nor debug_server seem to be able to make heads or tails of it, gdb displays several hundred stack entries all of which are ?? followed by
Previous frame inner to this frame (corrupt stack?)
Whereas debug_server lists the following in syslog:
KERN: vm_soft_fault: va 0xb000 not covered by area in address space KERN: vm_page_fault: vm_soft_fault returned error 'Bad address' on fault at 0xb7d6, ip 0xb7d6, write 0, user 1, thread 0xe3 KERN: vm_page_fault: thread "sshd" (227) in team "sshd" (227) tried to read address 0xb7d6, ip 0xb7d6 ("???" +0xb7d6) KERN: debug_server: TeamDebugHandler::Init(): Failed to get executable path of team 227: No such file or directory KERN: debug_server: Thread 227 entered the debugger: Segment violation KERN: stack trace, current PC 0xb7d6 : KERN: (0x7ffeea84) 0x7ffeeaa0 (sshd_main_stack + 0xfffaa0) KERN: (0x7ffeed4c) 0x20e4f1 (sshd_seg0ro + 0xe4f1) KERN: (0x7ffeee2c) 0x20fbe2 (sshd_seg0ro + 0xfbe2) KERN: (0x7ffeef7c) 0x20c9af (sshd_seg0ro + 0xc9af) KERN: (0x7ffeefac) 0x1008ea (runtime_loader_seg0ro + 0x8ea)
Attachments (1)
Change History (41)
comment:1 by , 16 years ago
Owner: | changed from | to
---|
comment:2 by , 16 years ago
Unfortunately the ssh executable seems to be stripped, so the stack trace is not telling that much. "sshd_seg0ro + 0xe4f1" is the return address of a select() call. Since select() basically only calls the syscall, nothing can really go wrong. The IP definitely shouldn't point to the stack. My best guess is that a signal occurred during the syscall and a signal handler was (or should have been) invoked, and something went wrong.
You don't have by any chance syscall and signal tracing enabled? That would at least help to verify the theory.
The description reads like this bug is not really new. Do you have an estimate when you first encountered it?
comment:3 by , 16 years ago
Honestly, for at least 2 months now, it's just that the last time (when I mentioned it to you in email since I didn't think the completely bogus trace from gdb would be helpful) I didn't know about the trace the debug_server puts out. If it helps it does seem to occur more frequently if I do something that involves sending/receiving a large amount of data across a previous session, log out, and then log back in, for instance doing an svn checkout from a session. It never occurs in an already active session, only during login. I'm relatively certain I always leave syscall/signal tracing enabled so I'll check the trace output on those next time.
follow-up: 7 comment:4 by , 16 years ago
I'm curious by the way, what's the "KERN: debug_server: TeamDebugHandler::Init(): Failed to get executable path of team 227: No such file or directory" error all about? Is that because of stripping also or is it something else?
comment:5 by , 16 years ago
Caught it again, trace output as follows:
kdebug>traced 0 30 -1 filter team 195 4014894. [ 195] 16916220581: syscall post: _kern_select() -> 0x8000000a 4014895. [ 195] 16916220582: signal handle: 0x10 4014896. [ 195] 16916220582: signal exec handler: signal: 5, handler: 0x0020cc88 4014984. [ 195] 16916248011: signal send: target: 64907, signal: 12 (CONT), flags: 0x20000 printed 4 entries within range 1 to 4100900 (4100900 of 4100900 total, 135147852 ever) kdebug>
comment:7 by , 16 years ago
Status: | new → assigned |
---|
Replying to anevilyak:
I'm curious by the way, what's the "KERN: debug_server: TeamDebugHandler::Init(): Failed to get executable path of team 227: No such file or directory" error all about? Is that because of stripping also or is it something else?
No, I think it's just another bug. When running sshd listimage doesn't show any images (through which the debug server normally gets the executable name). Maybe something we miss to do on fork(). Needs looking into.
Replying to anevilyak:
64907 btw is the "team 195 debug task" thread.
Yeah, that happens when the debugger is installed for the team.
Anyway, it looks as suspected: The thread gets a SIGCHLD (probably after fork()) while doing a select(), the signal handler is invoked, and the thread crashes. 28 ms pass, though, which seems quite a bit of time, but maybe the system was a bit busy.
Looking into the ssh sources (particularly how the signal handler is installed and what it does) might yield some new leads. Will do that.
comment:8 by , 16 years ago
It shouldn't have been busy, the system was idle at the time, though it's possible the page writer was still writing things to disk as it had just finished a jam -aq.
comment:9 by , 16 years ago
Component: | Applications/Command Line Tools → System/Kernel |
---|
The SIGCHLD handlers in the OpenSSH code look rather unsuspicious. So this is most likely a kernel issue. Not sure what could cause it, though.
comment:10 by , 16 years ago
I can reproduce it pretty reliably here... is there any extra information/tracing I can enable that would help? (for reference, it's happened to me quite consistently if I did the following steps:
- log into ssh remotely.
- . /system/boot/SetupEnvironment
- svn+ssh checkout of haiku sources
- configure --use-gcc-pipe --include-3rdparty
- jam -q
- log out / log back in -> crash.
Unfortunately somewhat time consuming steps but...I haven't yet managed to find a way to come up with a smaller test case. A checkout alone didn't do it anyways.
comment:11 by , 16 years ago
I just realized something: why is it receiving a SIGCHLD at that point in the first place? I thought that signal is generally used to indicate that a child exited, whereas if this crash happens on login one would expect a child to have just been created if I understand the way the sshd works correctly.
comment:12 by , 16 years ago
In Haiku SIGCHLD is also sent when a child is stopped (in other OSs also when a child is continued), though that probably doesn't happen in this case.
The ssh connection initiation is somewhat complex, which, I believe, has to do with how the privilege separation works. According to the team tracing, the daemon forks and re-executes. This new process forks off a child, which exits a bit later (probably after authentication), and finally forks off another child executing the shell. So the exiting child seems to be normal.
ATM I've no clue how things can go wrong. The signal handler should execute several syscalls, which should show up in the tracing, but don't. So one has to assume the signal handler is somehow set up incorrectly. An incorrect stack wouldn't explain why no syscall shows up, though, since as long as at least the eip is set correctly, the correct code should be executed. I'll probably have to add more debugging means to understand this better.
comment:13 by , 16 years ago
Understood. Is there any way you could build a non-stripped set of OpenSSH binaries, as that might at least yield some hints as to what calls were being made at the time? Any other debug tests you'd suggest I could run in the meantime? maybe turning on the extra debug logging options in sshd? Please let me know what I can do to help track this one down :)
comment:14 by , 16 years ago
Hi Ingo,
I tried this test again but using sshd -ddd, and got slightly different but still interesting behavior. This time it didn't actually segfault, but sshd did somehow completely exit after I logged out the session: last few lines of debugging output from sshd:
debug2: channel 0: rcvd adjust 32778 debug1: server_input_channel_req: channel 0 request window-change reply 0 debug1: session_by_channel: session 0 channel 0 debug1: session_input_channel_req: session 0 req window-change debug1: server_input_channel_req: channel 0 request window-change reply 0 debug1: session_by_channel: session 0 channel 0 debug1: session_input_channel_req: session 0 req window-change debug1: Received SIGCHLD. debug1: session_by_pid: pid 231 debug1: session_exit_message: session 0 channel 0 pid 231 debug2: channel 0: request exit-status confirm 0 debug1: session_exit_message: release channel 0 debug2: channel 0: write failed debug2: channel 0: close_write debug2: channel 0: output open -> closed debug1: session_pty_cleanup: session 0 release /dev/tt/p1 debug2: channel 0: read<=0 rfd 7 len 0 debug2: channel 0: read failed debug2: channel 0: close_read debug2: channel 0: input open -> drain debug2: channel 0: ibuf empty debug2: channel 0: send eof debug2: channel 0: input drain -> closed debug2: channel 0: send close debug2: notify_done: reading debug3: channel 0: will not send data after close debug2: channel 0: rcvd close debug3: channel 0: will not send data after close debug2: channel 0: is dead debug2: channel 0: gc: notify user debug1: session_by_channel: session 0 channel 0 debug1: session_close_by_channel: channel 0 child 0 debug1: session_close: session 0 pid 0 debug2: channel 0: gc: user detached debug2: channel 0: is dead debug2: channel 0: garbage collecting debug1: channel 0: free: server-session, nchannels 1 debug3: channel 0: status: The following connections are open: #0 server-session (t4 r0 i3/0 o3/0 fd -1/-1 cfd -1) debug3: channel 0: close_fds r -1 w -1 e -1 c -1 Connection closed by 192.168.1.126 debug1: do_cleanup Closing connection to 192.168.1.126
comment:15 by , 16 years ago
Possibly related, possibly a different bug, but in revisions of the last week or so I've been seeing some new behavior: on large data transfers (read: > 1GB or so), sshd keeps spontaneously terminating itself with the following error message:
Disconnecting: Corrupted MAC on input.
I haven't seen any changes to the network stack that would influence something like this, and I didn't think the I/O scheduler framework was in any way involved so this one confuses me.
comment:16 by , 16 years ago
Confirmed, that appears to have been caused by 26980/26981. Will file a new bug.
follow-up: 18 comment:17 by , 16 years ago
Since the segfault problem still remains though...have the patches for OpenSSH to build/work with Haiku been submitted upstream? If so I could pull the source and run the testsuites and such myself to see if I can help track this down.
comment:18 by , 16 years ago
Replying to anevilyak:
Since the segfault problem still remains though...have the patches for OpenSSH to build/work with Haiku been submitted upstream? If so I could pull the source and run the testsuites and such myself to see if I can help track this down.
Nope, the patches are available in our repository in buildtools/patches/openssh/, though.
BTW, weird, I definitely wrote a reply to your comment 6 days ago. I wonder what happened. Anyway, I was only saying that I don't think the problems are related -- but it seems you found the causing revision already -- and that the SSH sources don't really look helpful at the place where the error message is emitted.
follow-up: 20 comment:19 by , 16 years ago
What does the kernel print when it enters KDL (first 20 lines should do)
Dumping the stack content, perhaps +/- 100 byte around esp could give some info, too.
comment:20 by , 16 years ago
Replying to marcusoverhagen:
What does the kernel print when it enters KDL (first 20 lines should do)
Dumping the stack content, perhaps +/- 100 byte around esp could give some info, too.
Do you mean by jumping into KDL when sshd segfaults and then asking for the team stack trace for sshd or what? In any case, I won't be able to check for a few days, I'm in the middle of moving at the moment so all my computers are packed.
Regards,
Rene
comment:21 by , 16 years ago
OK so, a non-stripped binary behaves exactly the same way (165 line backtrace, most of which are ??'s). However, I do notice that every single time it crashes, the trace starts in select() in one of the earlier frames. I seem to remember mmu_man hitting some problems in select() in netsurf also though. The stack trace is oddly filled with addresses like 0xcccccccc and 0xffffffff which to me look like malloc debugging-related fillers.
comment:22 by , 16 years ago
Also, syscall/signal tracing looks identical to that posted in a previous comment, post select, sigchld and all.
comment:23 by , 16 years ago
I believe the gdb stack trace is bad due to a missing feature -- the handling of signal trampoline code is currently missing, so it probably restores the stack frame's registers incorrectly, thus totally screwing things up later. The debug server stack trace should still be fine. The stack trace from the kernel debugger won't be able to resolve static symbols, but it will also include all register values at the point of the crash.
Unlike I wrote earlier, the stack address in the return addresses of the stack trace is totally fine. When a signal handler is invoked, the kernel copies a bit of code onto the stack that will be executed when the signal handler returns. The stack trace suggests that the stack frame of the signal handler function still exists, i.e. it hasn't returned yet.
You could do the following to investigate this further:
- Add a few ktrace_printf()s in the signal handler function. Hopefully helps to understand where it crashes.
- Get a stack trace from KDL. Maybe knowing the registers helps somehow.
- Disassemble the code at the position of the signal handler. The (signal) kernel tracing gives you the address.
in_context <thread ID> dis <address>
will do the trick. This should help to rule out that someone scrambled the memory. - Dump a bit of the stack. In doubt up to the address of the code on the stack (0x7ffeeaa0 in the above stack trace), e.g.
in_context <thread ID> dw <address> 32
.
Good luck!
comment:24 by , 16 years ago
Thanks for the tips, will see what I can find out with that. Unfortunately I lost the stack trace for this last run in syslog because mine is flooded with debug messages from the page writer (perhaps unintentionally left enabled)?
comment:25 by , 16 years ago
OK, got the crash again, trace from syslog as follows (note, this time it happened in mid data transfer, not on forking a new child connection though):
USER: Disconnecting: Bad packet length 905182364. KERN: vm_soft_fault: va 0xb2000 not covered by area in address space KERN: vm_page_fault: vm_soft_fault returned error 'Bad address' on fault at 0xb2ee0, ip 0x40f159, write 0, user 1, thread 0x14dec KERN: vm_page_fault: thread "sshd" (85484) in team "sshd" (85484) tried to read address 0xb2ee0, ip 0x40f159 ("libroot.so_seg0ro" +0x33159) KERN: stack trace: KERN: 0x0040fbaaKERN: (libroot.so_seg0ro + 0x33baa) KERN: 0x004112e1 (libroot.so_seg0ro + 0x352e1) KERN: 0x0023b2ca (sshd_seg0ro + 0x3b2ca) KERN: 0x0020e457 (sshd_seg0ro + 0xe457) KERN: 0x0020fc63 (sshd_seg0ro + 0xfc63) KERN: 0x0020ca1f (sshd_seg0ro + 0xca1f) KERN: 0x001008eaKERN: (runtime_loader_seg0ro + 0x8ea) KERN: 0x7ffeefec (sshd_main_stack + 0xffffec) KERN: vm_soft_fault: va 0x0 not covered by area in address space KERN: vm_page_fault: vm_soft_fault returned error 'Bad address' on fault at 0x0, ip 0x800cde34, write 0, user 0, thread 0x14dec KERN: debug_server: Thread 85484 entered the debugger: Segment violation KERN: stack trace, current PC 0x40f159 freeBlock__Q28BPrivate9hoardHeapRPQ28BPrivate5blockRPQ28BPrivate10superblockiPQ28BPrivate11processHeap + 0x879: KERN: (0x7ffeec9c) 0x40fbaa free__Q28BPrivate11processHeapPv + 0x1ee KERN: (0x7ffeecec) 0x4112e1 free + 0xb1 KERN: (0x7ffeed1c) 0x23b2ca xfree + 0x36 KERN: (0x7ffeed4c) 0x20e457 server_accept_loop + 0x143 KERN: (0x7ffeee1c) 0x20fc63 main + 0x1137 KERN: (0x7ffeef7c) 0x20ca1f _start + 0x5b KERN: (0x7ffeefac) 0x1008ea runtime_loader + 0x126
comment:26 by , 16 years ago
Kernel stack trace:
stack trace for thread 85484 "sshd" kernel stack: 0x801b5000 to 0x801b9000 user stack: 0x7efef000 to 0x7ffef000 frame caller <image>:function + offset 0 801b87ac (+ 48) 80045ef1 <kernel_x86>:context_switch__FP6threadT0 + 0x0039 1 801b87dc (+ 64) 80046170 <kernel_x86>:scheduler_reschedule + 0x0248 2 801b881c (+ 64) 80047504 <kernel_x86>:switch_sem_etc + 0x036c 3 801b885c (+ 64) 8004716a <kernel_x86>:acquire_sem_etc + 0x0026 4 801b889c (+ 80) 800446d6 <kernel_x86>:read_port_etc + 0x0152 5 801b88ec (+1424) 80061f4a <kernel_x86>:thread_hit_debug_event_internal__F22debug_debugger_messagePCvlbRb + 0x02fa 6 801b8e7c (+ 64) 800621bf <kernel_x86>:thread_hit_debug_event__F22debug_debugger_messagePCvlb + 0x002f 7 801b8ebc (+ 80) 800623ab <kernel_x86>:user_debug_exception_occurred + 0x005b 8 801b8f0c (+ 80) 800bb374 <kernel_x86>:vm_page_fault + 0x0314 9 801b8f5c (+ 64) 800cab15 <kernel_x86>:page_fault_exception + 0x00d9 10 801b8f9c (+ 12) 800ce1cf <kernel_x86>:int_bottom_user + 0x005a user iframe at 0x801b8fa8 (end = 0x801b9000) eax 0xb2ee0 ebx 0x48d4fc ecx 0x180015a0 edx 0x180015a0 esi 0x0 edi 0x1802f928 ebp 0x7ffeec9c esp 0x801b8fdc eip 0x40f159 eflags 0x210297 user esp 0x7ffeec64 vector: 0xe, error code: 0x4 11 801b8fa8 (+ 0) 0040f159 </boot/beos/system/lib/libroot.so@0x003dc000>:unknown + 0x33159 12 7ffeec9c (+ 80) 0040fbaa </boot/beos/system/lib/libroot.so@0x003dc000>:unknown + 0x33baa 13 7ffeecec (+ 48) 004112e1 </boot/beos/system/lib/libroot.so@0x003dc000>:unknown + 0x352e1 14 7ffeed1c (+ 48) 0023b2ca </boot/home/config/bin/sshd@0x00200000>:unknown + 0x3b2ca 15 7ffeed4c (+ 208) 0020e457 </boot/home/config/bin/sshd@0x00200000>:unknown + 0xe457 16 7ffeee1c (+ 352) 0020fc63 </boot/home/config/bin/sshd@0x00200000>:unknown + 0xfc63 17 7ffeef7c (+ 48) 0020ca1f </boot/home/config/bin/sshd@0x00200000>:unknown + 0xca1f 18 7ffeefac (+ 48) 001008ea </boot/beos/system/runtime_loader@0x00100000>:unknown + 0x08ea 19 7ffeefdc (+ 0) 7ffeefec 2599177:sshd_main_stack@0x7efef000 + 0xffffec
traced 0 30 -1 filter team 85484 turns up nothing in this case, oddly enough, so I'm not sure where to start disassembling at.
comment:27 by , 16 years ago
In case it's at all useful, disassembly at the program counter given in debug_server's trace:
kdebug> in_context 85484 dis 0x40f159 0x0040f159: 8b34b0 mov (%eax,%esi,4), %esi 0x0040f15c: 85f6 test %esi, %esi 0x0040f15e: 7522 jnz 0x40f182 0x0040f160: 8d838433feff lea 0xfffe3384(%ebx), %eax 0x0040f166: 50 push %eax 0x0040f167: 68a6010000 push $0x1a6 0x0040f16c: 8d834430feff lea 0xfffe3044(%ebx), %eax 0x0040f172: 50 push %eax 0x0040f173: 8d83b333feff lea 0xfffe33b3(%ebx), %eax 0x0040f179: 50 push %eax
comment:28 by , 16 years ago
I just noticed also that in this particular case the backtrace was accessible from gdb:
(gdb) bt #0 0x0040f159 in BPrivate::hoardHeap::freeBlock () from /boot/beos/system/lib/libroot.so #1 0x0040fbaa in BPrivate::processHeap::free () from /boot/beos/system/lib/libroot.so #2 0x004112e1 in free () from /boot/beos/system/lib/libroot.so #3 0x0023b2ca in xfree (ptr=0x18031918) at xmalloc.c:81 #4 0x0020e457 in server_accept_loop (sock_in=0x7ffeef24, sock_out=0x7ffeef28, newsock=0x7ffeef20, config_s=0x7ffeef2c) at sshd.c:1058 #5 0x0020fc63 in main (ac=1, av=0x1800e390) at sshd.c:1651
comment:29 by , 16 years ago
Got it with a new connection this time, same stack trace as shown in the previous case involving hoard though. This time I actually have tracing info though, as follows:
traced 0 30 -1 filter team 287 3774754. [ 287] 2703350606: syscall post: _kern_select() -> 0x8000000a 3774755. [ 287] 2703350607: signal handle: 0x10 3774756. [ 287] 2703350607: signal exec handler: signal: 5, handler: 0x0020ccf8 3774757. [ 287] 2703350630: syscall pre: _kern_wait_for_child(0xffffffff, 0x1, 0x7ffeea2c, 0x7ffeea30) 3774758. [ 287] 2703350636: syscall post: _kern_wait_for_child() -> 0x120 3774759. [ 287] 2703350637: syscall pre: _kern_wait_for_child(0xffffffff, 0x1, 0x7ffeea2c, 0x7ffeea30) 3774760. [ 287] 2703350637: syscall post: _kern_wait_for_child() -> 0x80007002 3774761. [ 287] 2703350645: syscall pre: _kern_sigaction(0x5, 0x0, 0x7ffeea14) 3774762. [ 287] 2703350646: syscall post: _kern_sigaction() -> 0x0 3774763. [ 287] 2703350647: syscall pre: _kern_restore_signal_frame() 3774764. [ 287] 2703350649: syscall post: _kern_select() -> 0x8000000a 3774973. [ 287] 2703423602: signal send: target: 23132, signal: 12 (CONT), flags: 0x20000 printed 12 entries within range 1 to 3958033 (3958033 of 3958033 total, 26863044 ever)
Assuming I understood the address you want disassembly from correctly out of that, disassembly output as follows:
kdebug> in_context 287 dis 0x0020ccf8 0x0020ccf8: 55 push %ebp 0x0020ccf9: 89e5 mov %esp, %ebp 0x0020ccfb: 83ec1c sub $0x1c, %esp 0x0020ccfe: 57 push %edi 0x0020ccff: 56 push %esi 0x0020cd00: 53 push %ebx 0x0020cd01: e800000000 call 0x20cd06 0x0020cd06: 5b pop %ebx 0x0020cd07: 81c3c6370500 add $0x537c6, %ebx 0x0020cd0d: e832f7ffff call 0x20c444 kdebug>
last address on the stack in this case was 0x7ffeec9c, dw dump:
kdebug> in_context 287 dw 0x7ffeec9c 32 [0x7ffeec9c] ìì..ª.@.d+...ì.. 7ffeecec 0040fbaa 18002b64 7ffeece4 [0x7ffeecac] èì..........3.. 7ffeece8 00000000 18000000 80060633 [0x7ffeecbc] =@ [0x7ffeeccc] .....ÔH.....s... 00000001 0048d4fc 00000001 00000073 [0x7ffeecdc] ............(... ffff0104 ffffffff 18031910 1802f928 [0x7ffeecec] .í..á.A......... 7ffeed1c 004112e1 18000000 18031918 [0x7ffeecfc] ÌÌÌÌ=.A.Ì.&..... cccccccc 0041123d 002604cc 18031918 [0x7ffeed0c] ................ 00000000 00000000 ffffffff ffffffff
Leaving box in kernel debugger for the time being in case further information is helpful. Note that at this point I can pretty much reproduce the crash 100% of the time by simply ssh'ing into the box via putty from my windows laptop and running jam -aq on the haiku tree. Either late in the build process, or by logging out and back in after the build completes, it will crash every time.
follow-up: 31 comment:30 by , 16 years ago
Unfortunately this looks all completely different from the earlier cases. The program does no longer crash with an invalid PC address while in the signal handler, but somewhere in the memory allocator. The signal handler has already finished according to the kernel tracing. The "dw" output looks a bit scrambled; not that I knew what to look for in this new situation.
If the crashes continue to happen in the memory allocator, you could compile libroot with debug info, copy the sources over (at least the ones from the allocator (src/system/libroot/posix/malloc)), and have a closer look with gdb.
It would also be interesting why the behavior has changed now. Did you do the previous test before hrev27925? That's at least something that changed in the allocator. If not, setting kHeapIncrement to B_PAGE_SIZE (in .../malloc/arch_specific.cpp) might bring back the old behavior. Depending on whether a gdb session turns something up with the current behavior, it might be easier to analyze the old one, since it crashes meaner and probably earlier.
comment:31 by , 16 years ago
Replying to bonefish:
It would also be interesting why the behavior has changed now. Did you do the previous test before hrev27925? That's at least something that changed in the allocator. If not, setting kHeapIncrement to B_PAGE_SIZE (in .../malloc/arch_specific.cpp) might bring back the old behavior. Depending on whether a gdb session turns something up with the current behavior, it might be easier to analyze the old one, since it crashes meaner and probably earlier.
That's actually possible, I seem to remember I still got the old behavior only a few days ago. Crashing in flight like it is now is new for me though, it used to only crash on new connection, never in mid data transfer. Though I think mmlr said he saw problems like that on his C2Q sometimes. Will try changing the heap increment tonight and see what happens.
comment:32 by , 16 years ago
Hm...changing the heap increment to B_PAGE_SIZE seems to have made no difference at all, still consistently crashing in hoard now. Will try to roll back commits a bit and see where the difference lies.
comment:33 by , 16 years ago
Finally hit this crash again with a current build. Oddly I've confirmed one thing: it only happens with a stripped build, the one with symbols results in the allocator crash every single time.
In any case, backtrace of sshd from the kernel:
stack trace for thread 264 "sshd" kernel stack: 0xa1f3a000 to 0xa1f3e000 user stack: 0x7efef000 to 0x7ffef000 frame caller <image>:function + offset 0 a1f3d7ac (+ 48) 80046020 <kernel_x86> context_switch(thread*: 0x9fd1d800, thread*: 0x9fd2e000) + 0x003c 1 a1f3d7dc (+ 64) 800462cc <kernel_x86>:scheduler_reschedule + 0x0264 2 a1f3d81c (+ 64) 80047660 <kernel_x86>:switch_sem_etc + 0x036c 3 a1f3d85c (+ 64) 800472c6 <kernel_x86>:acquire_sem_etc + 0x0026 4 a1f3d89c (+ 80) 80044802 <kernel_x86>:read_port_etc + 0x0152 5 a1f3d8ec (+1424) 800621be <kernel_x86> debug_debugger_message::thread_hit_debug_event_internal(0x8, int32: -1577853208, true, 0x8006d801) + 0x02fa 6 a1f3de7c (+ 64) 80062433 <kernel_x86> debug_debugger_message::thread_hit_debug_event(0x8, int32: -1577853208, true) + 0x002f 7 a1f3debc (+ 80) 8006261f <kernel_x86>:user_debug_exception_occurred + 0x005b 8 a1f3df0c (+ 80) 800bbad0 <kernel_x86>:vm_page_fault + 0x0314 9 a1f3df5c (+ 64) 800cb655 <kernel_x86>:page_fault_exception + 0x00d9 10 a1f3df9c (+ 12) 800cecbf <kernel_x86>:int_bottom_user + 0x005a user iframe at 0xa1f3dfa8 (end = 0xa1f3e000) eax 0x0 ebx 0x25fe04 ecx 0x7ffeed10 edx 0x2 esi 0x7ffeea80 edi 0x20cc88 ebp 0x7ffeea84 esp 0xa1f3dfdc eip 0xb7d6 eflags 0x210213 user esp 0x7ffeea48 vector: 0xe, error code: 0x4 11 a1f3dfa8 (+ 0) 0000b7d6 12 7ffeea84 (+ 712) 7ffeeaa0 6533:sshd_main_stack@0x7efef000 + 0xfffaa0 13 7ffeed4c (+ 224) 0020e4f1 </boot/home/config/bin/sshd@0x00200000>:unknown + 0xe4f1 14 7ffeee2c (+ 336) 0020fbe2 </boot/home/config/bin/sshd@0x00200000>:unknown + 0xfbe2 15 7ffeef7c (+ 48) 0020c9af </boot/home/config/bin/sshd@0x00200000>:unknown + 0xc9af 16 7ffeefac (+ 48) 001008ea </boot/beos/system/runtime_loader@0x00100000>:unknown + 0x08ea 17 7ffeefdc (+ 0) 7ffeefec 6533:sshd_main_stack@0x7efef000 + 0xffffec
For some reason, I have no trace entries whatsoever for that team though, so the only other useful thing I can do is dump memory at the stack address:
kdebug> in_context 264 dw 0x7ffeeaa0 32 [0x7ffeeaa0] ....9.........T$ b80cc483 00000039 000000b9 24548d00 [0x7ffeeab0] .Íc....... .... c363cd04 ffff0104 00200217 8000000a [0x7ffeeac0] .í......Üßó¡Lí.. 7ffeed10 00000000 a1f3dfdc 7ffeed4c [0x7ffeead0] .í.......@...... 7ffeed10 ffff037f ffff4000 ffffffff [0x7ffeeae0] íVK...}.=.p#... 004b56ed 077d001b 70143d84 ffff0023 [0x7ffeeaf0] Ú............... 0d0d0bda 00000000 0000ffff 00000000 [0x7ffeeb00] ................ ffff0000 00000000 00000000 0000ffff [0x7ffeeb10] ................ 00000000 ffff0000 00000000 00000000 kdebug>
Hope this helps in some way.
comment:34 by , 16 years ago
Interestingly, something recently seems to have broken sshd entirely here: It now no longer appears to sense a socket connection at all for some reason, at least sshd -ddd doesn't indicate having picked up an inbound connection at all. Could this possibly be due to the recent package updates? I can try to binary search and find a responsible revision if it'd help.
comment:35 by , 16 years ago
And now it works again, though I have no idea which change might have unbroken it. Not sure if the segfault still happens or not, will let you know.
follow-up: 38 comment:37 by , 15 years ago
Out of curiosity, given that this only seemed to happen (for me) after days of uptime, is it at all possible that it was a victim of that NewOS-originated VM bug you recently fixed? I can try and set up another extended uptime test to see if it's still reproducible.
follow-up: 39 comment:38 by , 15 years ago
Replying to anevilyak:
Out of curiosity, given that this only seemed to happen (for me) after days of uptime, is it at all possible that it was a victim of that NewOS-originated VM bug you recently fixed? I can try and set up another extended uptime test to see if it's still reproducible.
If you mean hrev36523 -- that would be very unlikely (probably even impossible) in single-threaded teams. Since a few things have been fixed/rewritten in the meantime, it is possible that this bug has been fixed, too.
comment:39 by , 15 years ago
comment:40 by , 15 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
No longer reproducible, closing.
Wrong owner.