Opened 4 years ago

Closed 4 years ago

Last modified 4 years ago

#12317 closed bug (invalid)

Debugger (silent?) crash, incomplete report

Reported by: ttcoder Owned by: anevilyak
Priority: normal Milestone: Unscheduled
Component: Applications/Debugger Version: R1/Development
Keywords: Cc:
Blocked By: Blocking:
Has a Patch: no Platform: All

Description

We've been getting forwarded .report files for CC that lack stack crawls. Did not know what to think of it until this week, when looking through their syslogs I managed to find a reference to a Debugger problem, which /might/ (I hope :-) be related and explain what is going on: maybe the .report is incomplete because Debugger has silently crashed behind the scenes while generating it ?

That particular machine seems to have several problems, including corrupted BFS index, unrelated/unexplained crashes in servers and daemons, so it could be that Debugger is another bystander to another problem, but filing a ticket just in case (there is another ticket about "shortened" .reports too IIRC)

KERN: vm_soft_fault: va 0x50007000 not covered by area in address space
KERN: vm_page_fault: vm_soft_fault returned error 'Bad address' on fault at 0x50007a47, ip 0xe512d6, write 0, user 1, thread 0xd70
KERN: vm_page_fault: thread "worker" (3440) in team "Debugger" (3439) tried to read address 0x50007a47, ip 0xe512d6 ("libroot.so_seg0ro" +0xaf2d6)
KERN: debug_server: Thread 3440 entered the debugger: Segment violation
KERN: stack trace, current PC 0xe512d6  malloc__Q28BPrivate10threadHeapUl + 0x4ca:
KERN:   (0x719c30c8)  0xe51bd4  malloc + 0x184
KERN:   (0x719c30f8)  0xcb5ec3  _Allocate__7BStringl + 0x27
KERN:   (0x719c3128)  0xcb5ff9  _Clone__7BStringPCcl + 0x21
KERN:   (0x719c3158)  0xcb5fb4  _Init__7BStringPCcl + 0x28
KERN:   (0x719c3188)  0xcb0f12  __7BStringPCc + 0x4a
KERN:   (0x719c31b8)  0xfeca13  GetSymbolInfos__17DebuggerInterfacellRt11BObjectList1Z10SymbolInfo + 0xe7
KERN:   (0x719c3608)  0xfe2ac1  FinishInit__14ImageDebugInfoP17DebuggerInterface + 0x4d
KERN:   (0x719c36a8)  0xfe5f28  LoadImageDebugInfo__13TeamDebugInfoRC9ImageInfoP13LocatableFileR26ImageDebugInfoLoadingStateRP14ImageDebugInfo + 0x1c4
KERN:   (0x719c36e8)  0xffa823  Do__21LoadImageDebugInfoJob + 0x8f
KERN:   (0x719c3768)  0x10830ed  _ProcessJobs__6Worker + 0x1f9
KERN:   (0x719c37a8)  0x1082df9  _WorkerLoop__6Worker + 0x21
KERN:   (0x719c37e8)  0x1082dcf  _WorkerLoopEntry__6WorkerPv + 0x1f
KERN:   (0x719c3818)  0xdd380b  thread_entry + 0x23

Attachments (2)

CommandCenter-201390-debug-07-08-2015-00-19-17.report (22.3 KB) - added by ttcoder 4 years ago.
No backtraces, missing(?) thread in thread list
class_NetworktimeprefletAsAddon_+main_+intentionalcrash.cpp (4.2 KB) - added by ttcoder 4 years ago.
Directly compilable and usable (unlike the original in 12319)

Download all attachments as: .zip

Change History (11)

Changed 4 years ago by ttcoder

No backtraces, missing(?) thread in thread list

comment:1 Changed 4 years ago by ttcoder

A good example of what we're getting from that client and a couple others: the active-threads list seems to be missing the _BMediaRoster_ thread, and none of the other threads have a backtrace/cause of crash. On that particular report global memory usage is unusually huge at 1.5 GB (with CC's heap itself at a perfectly normal 12 MB) but on others it's down to 300 MB and still gets those truncated reports.

comment:2 Changed 4 years ago by diver

Component: - GeneralApplications/Debugger
Owner: changed from nobody to anevilyak

comment:3 Changed 4 years ago by anevilyak

The overall system symptoms described here sound suspiciously similar to ticket #10279, especially in light of this also being an AMD system as in that ticket. That having been said, this report actually does not appear to be truncated per se. The report generator operates in a strictly sequential fashion, ergo if it had crashed while retrieving the stack trace, the report would have stopped at the thread list, while this one appears to have made it all the way to the end of listing all the semaphores (the last step in report generation). The crash excerpt from syslog as such appears to be unrelated to this particular report, but for what it's worth appears to be heap corruption-related, as it's simply in the process of retrieving the list of symbols for an image there. Without further information there's not really much to be able to analyze there.

With regards to this report specifically: stack traces are only dumped for any thread that is in the team and listed in some form of stopped state. These include unrecoverable exception (the most typical reason for a crash, this covers page fault, divide by zero, etc.), debug assertion (an explicit debugger() call or assert() failure), or simply having been manually debugged while running in the debugger itself. In this particular report, none of the threads appear to be in such a state, hence the absence of a backtrace. Since I'm unfamiliar with CommandCenter however, I'm unable to say if a particular thread is missing in action, but since you say _MediaRoster_ is expected to be there, it would be interesting to know if syslog contains anything of interest with regards to CommandCenter specifically (excerpts like the one posted in this ticket are printed for the crashing team even before entering the debugger itself). I"m afraid I'm at a loss as to why the thread would be gone entirely though, but I don't really see how Debugger itself could be at fault, since the crash state is trapped by the kernel before handing it off to the debug_server to decide what to do with it (debug_server is the one responsible for presenting the usual Terminate, Debug, Save Report dialog and taking the requested action), so this may indicate an issue somewhere along the way there.

Furthermore, it may also be worth enabling syslog timestamps in order to be able to more accurately correlate syslog information with the time at which a report was generated, since I'm presuming that the customer isn't necessarily supplying you with this data in real time.

comment:4 Changed 4 years ago by ttcoder

@anevilyak Wondering -- what happens if another, still ongoing, thread of the 'crashed' team brutally calls `kill_thread()' on the segfault'ed thread?

Thinking specifically of ticket:12319, whose kill_thread occurs in the 'main' thread, separate from the one that crashes, and occurs after a 2 seconds delay, presumably even if the team has tripped into Debugger ?.

When I have time I'll make a quick experiment, make the thread crash reliably by adding a char * ptr = NULL; puts(a); statement in the above mentionned thread code, wait 2 seconds before clicking "debugger" or "save report", to see if that's enough to reproduce the problems: I'll check..

  • if the thread has indeed disappeared from the list
  • if there's hints of a Debugger segfault in syslog (though that issue seems to deserve a separate ticket)

comment:5 Changed 4 years ago by anevilyak

I'm not entirely certain right now, but I believe a thread that's currently trapped in the kernel debug facilities can't be killed, Axel or Ingo might be able to answer that with more certainty. With regards to the other ticket though, I see no mention of thread kills there, so I'm unclear as to how that one specifically applies. Did you mean a different ticket number?

comment:6 Changed 4 years ago by anevilyak

In any case, with regards to the debugger segfault that you copy/pasted, that one seems highly unlikely to be in any way related, or, as I previously said, your debug report wouldn't have made it to the end at all (and/or you'd have a separate debug report/crash dialog for the debugger itself). Either way, the backtrace on that one is vague enough that there's really no useful information as to what caused it to crash, and given the other generally reported symptoms of the system in question, sounds more like it was a victim of an unrelated problem.

comment:7 Changed 4 years ago by ttcoder

Confirming...

  • I mixed up two separate issues in this ticket: the Debugger segfault is not reproducible here with the steps outlined above.
  • the "disappearing thread" IS reproducible however. Good news!

To undo my mess we could close this ticket and open a clean one for each separate issue; well actually I'm not too interested in the Debugger crash right now as I only ever saw that once, but I'd like to dig on the disappearing thread aspect :-)

Posting the source below (note: it's a modification of the file in #12319, not the exact same file);

Steps to replicate the behavior:

  • compile: gcc class_NetworktimeprefletAsAddon.cpp -lbe
  • run
  • *quickly* click the "debug" button on the alert that pops up when the team crashes: if you look carefully, you will see the Debugger's listview update live, going from 3 lines to 2 lines, removing the line with the thread that is killed
  • alternatively, do not click the "debug" button for 2 seconds, then when you click it the listview listing threads only litsts 2 of them: the main thread and the attached debugger thread.

Discussion (to continue in a new /clean ticket maybe):

  • wouldn't it be better if the kernel "held tight" on the crashed thread, instead of letting it go ?
  • if it isn't, how could Debugger give a hint about what happened ? Seeing a list of threads with no indication of any of them being stopped gives the (false) impression that the Debugger report is spurious and the team did not really crash. We probably don't want that.

EDIT: clearly I was using the wrong tool for the job, best rely on the syslog to have a "snapshot" situation at the time of the crash. As a general example, we sometimes have DJ's coming in the morning and clicking "create report" for a crash that occured during the night, and obviously the situation can have changed during the elapsed several hours. Note to self: the syslog backtrace can be prettyfied with /bin/c++filt

Last edited 4 years ago by ttcoder (previous) (diff)

Changed 4 years ago by ttcoder

Directly compilable and usable (unlike the original in 12319)

comment:8 in reply to:  7 Changed 4 years ago by anevilyak

Resolution: invalid
Status: newclosed

Replying to ttcoder:

Discussion (to continue in a new /clean ticket maybe):

  • wouldn't it be better if the kernel "held tight" on the crashed thread, instead of letting it go ?

That presents its own problems, since in the case where something goes wrong, that leaves you with a permanently unkillable thread without a reboot.

  • if it isn't, how could Debugger give a hint about what happened ? Seeing a list of threads with no indication of any of them being stopped gives the (false) impression that the Debugger report is spurious and the team did not really crash. We probably don't want that.

All the Debugger knows is the kernel told it the thread exited, so there's no way for it to differentiate that. That having been said, to be perfectly blunt, the attached code sample is pretty much a case study in how not to do things.

  • kill_thread() should never be getting used as a matter of course by application code, it exists as a last resort, and only that, since among other things, any resources the thread may have allocated would have been leaked in such a case, i.e. any memory/objects/sockets allocated by ntp_update_time(), and in fact, could potentially even cause issues/corruption for future invocations within the same team. kill_thread() exists solely for the use of tools like kill or ProcessController. Furthermore, since Haiku is currently single user, any mistake can potentially result in it killing threads in other teams, leading to other unpredictable behavior.
  • The way UpdateSystemTime() currently works is pretty horribly broken, as it will result in the aforementioned condition getting triggered if there's any transient network lag or temporary loss in connectivity, and as such, if this is being run periodically will cause the hosting application to leak resources over time, eventually leading to other code randomly failing.
  • Last but not least, loading an image as an add-on that isn't designed to be, such as an application image, is asking for trouble. That will cause global constructors/destructors to be called, which, if there are global and/or static variables involved, could potentially mess with the internal state of the calling application. Since all the source code in question is available, if you really want to use it this way, it would make more sense to simply pull that code into your own function and call it directly rather than the series of messy kludges used here. Alternatively, submitting a patch that pulls the ntp functionality into a daemon/standalone app that could be invoked by others would also be nice.

All in all, you're pretty much causing your own problems here in many ways, and I really don't see any good reason to go out of the way to accommodate such a situation.

comment:9 in reply to:  5 Changed 4 years ago by bonefish

Replying to anevilyak:

I'm not entirely certain right now, but I believe a thread that's currently trapped in the kernel debug facilities can't be killed, Axel or Ingo might be able to answer that with more certainty.

I'm too lazy to look at the code or test it ATM. If I had to implement the debug facilities, I would allow killing debugged threads. Given that I did implement those, I'd assume killing a debugged thread is possible. It should be easy enough to test it by debugging a team and killing the debugged thread e.g. with ProcessController.

Note: See TracTickets for help on using tickets.