Opened 10 years ago

Last modified 4 weeks ago

#6673 reopened bug

[registrar] can be crashed by running multiple instances of mimeset

Reported by: samui Owned by: bonefish
Priority: critical Milestone: R1
Component: Servers/registrar Version: R1/Development
Keywords: Cc:
Blocked By: Blocking: #11063
Platform: All

Description

reproducable bug. click on a video a gazillion times and see what happens.

sometimes you get and out of memory error, sometimes a bad port id error. both of these errors randomly crash deskbar and or tracker. the mouse freezes.

Sometimes you are dropped to kdl with the enclosed info.

256mb of ram, no virtual memory enabled, build 38817

Attachments (12)

test2LauchManyVideosTrackerCrash.png (54.1 KB ) - added by samui 10 years ago.
outOf256mbMem.png (79.0 KB ) - added by samui 10 years ago.
create_app_meta_mime.jpg (1.6 MB ) - added by diver 9 years ago.
MIMEManager.jpg (1.5 MB ) - added by diver 9 years ago.
kdl_mimeset.png (29.7 KB ) - added by diver 9 years ago.
syslog (797.1 KB ) - added by diver 9 years ago.
listport (55.9 KB ) - added by diver 9 years ago.
debug_session (46.4 KB ) - added by diver 9 years ago.
listport_form_debug_session (56.0 KB ) - added by diver 9 years ago.
registrar-189-debug-03-06-2014-13-39-03.report (11.6 KB ) - added by diver 6 years ago.
registrar-521-debug-24-08-2020-08-52-28.report (17.6 KB ) - added by madmax 4 weeks ago.
registrar-520-debug-24-08-2020-11-49-12.report (12.2 KB ) - added by madmax 4 weeks ago.

Change History (39)

by samui, 10 years ago

Attachment: outOf256mbMem.png added

comment:1 by stippi, 10 years ago

When you say "click on a video a gazillion times", you mean double click it in Tracker, right? Do you leave each instance open, or do you close it each time before you open it again? Since one of my changes to the media node based playback in MediaPlayer, the video producer node and the video consumer node share the same video buffers (to avoid having to copy a video frame when it is transfered from the producer to the consumer node). Upon destruction of the connection, the media_server failes to deallocate these buffers. I have not investigated this issue closer for the time being. Just a hint in case I don't get around to fixing that, for whoever wants to work on it later.

comment:2 by stargatefan, 10 years ago

I was wondering if maybe that was the issue with gnash. Buffer deallocation.

comment:3 by diver, 9 years ago

Component: Audio & VideoServers/registrar
Owner: changed from nobody to bonefish
Version: R1/alpha2R1/Development

I can reproduce this bug easily on real hardware by opening /boot/common/boot/post_install, selecting mime_update.sh and holding down the return key for about 30 seconds. Most of the time registar crashes in create_app_meta_mime thread or in main_mime thread (MIMEManager::MessageReceived). Sometimes it KDLs (see attached images). I tried to find a shorter way to reproduce this bug, but behavior has changed:

for i in `seq 1 10000`; do mimeset -apps -f /system/apps/*; done

After running this command (in vbox, so take with a grain of salt) simultaneously in several tabs registrar just dies silently and launching new apps throws Bad port ID error (sounds a lot like #381).

Last edited 5 years ago by diver (previous) (diff)

by diver, 9 years ago

Attachment: create_app_meta_mime.jpg added

by diver, 9 years ago

Attachment: MIMEManager.jpg added

by diver, 9 years ago

Attachment: kdl_mimeset.png added

comment:4 by diver, 9 years ago

After getting Bad port ID error Tracker crashed. I tried to find it in syslog after reboot and found something, but I'm not sure if it's really the tracker crash traces, but anyway:

KERN: 16665: DEBUGGER: The clipboard must be locked before proceeding.
KERN: debug_server: Thread 16665 entered the debugger: Debugger call: `The clipboard must be locked before proceeding.'
KERN: stack trace, current PC 0xffff0114  :
KERN:   (0x782068bc)  0x2c5821  _ZNK10BClipboard13_AssertLockedEv + 0x39
KERN:   (0x782068dc)  0x2c5849  _ZNK10BClipboard4DataEv + 0x1d
KERN:   (0x782068fc)  0x518b64  _Z23FSClipboardFindNodeModePN8BPrivate5ModelEbb + 0x6c
KERN:   (0x782069dc)  0x57d3e0  _ZN8BPrivate9BPoseView11CreatePosesEPPNS_5ModelEPNS_8PoseInfoElPPNS_5BPoseEbPlP5BRectb + 0x814
KERN:   (0x78206b3c)  0x5891a6  _ZN8BPrivate9BPoseView15MessageReceivedEP8BMessage + 0x386
KERN:   (0x78206c3c)  0x2caa38  _ZN7BLooper15DispatchMessageEP8BMessageP8BHandler + 0x50
KERN:   (0x78206c5c)  0x38e404  _ZN7BWindow15DispatchMessageEP8BMessageP8BHandler + 0x19da
KERN:   (0x78206f1c)  0x389d09  _ZN7BWindow11task_looperEv + 0x22b
KERN:   (0x78206f9c)  0x2cc29a  _ZN7BLooper7_task0_EPv + 0x30
KERN:   (0x78206fbc)  0x66518d  thread_entry + 0x2d

comment:5 by diver, 9 years ago

Summary: Deskbar freezes under heavy load with mediaplayer[registrar] can be crashed by running multiple instances of mimeset

comment:6 by diver, 9 years ago

Could it be that MediaPlayer registers mime types on every start? Then it could explain observed behavior from the description with regard to registrar.

comment:7 by axeld, 9 years ago

Priority: normalcritical

comment:8 by bonefish, 9 years ago

The KDL is tracked in #5474.

Regarding the Tracker crash, that's due to BPoseView::CreatePoses() not checking the return value of be_clipboard->Lock(), just assuming it worked. Since BClipboard::Lock() does synchronous messenging, which might require a port to be created on the fly, running out of ports can cause BClipboard::Lock() to fail.

Why the system is running out of ports is an interesting question, though. The registrar seems to limit the MIME update threads to 12, denying any further requests after the limit is reached. So, at least in theory, there shouldn't be any resources piling up in the registrar due to MIME update requests. The output of a listport (or ports in KDL -- long, so skip and fetch from syslog) in such a situation would at least help to identify who is leaking/hogging what kind of ports.

Why the registrar crashes is another interesting question. One stack trace is an hoard assert, so the heap had been corrupted. The other is an outright crash in MIMEManager::MessageReceived(), though I haven't spotted any suspicious code (the syslog would have been nice). Might also be a side effect of a heap corruption.

What looks a bit suspicious is RegistrarThreadManager. It uses atomic operations to change fThreadCount, but doesn't do any locking to protect the thread list (fThreads). I haven't investigated this further, but so far I've seen the RegistrarThreadManager only being used in the MIMEManagers thread. So the atomic operations might simply be superfluous.

in reply to:  8 ; comment:9 by diver, 9 years ago

Replying to bonefish:

The output of a listport (or ports in KDL -- long, so skip and fetch from syslog) in such a situation would at least help to identify who is leaking/hogging what kind of ports.

My keyboard doesn't work in KDL, I do have auto_stack_trace installed, though. Is there a way to modify it to auto execute listport/ports. Also, how should I enter KDL? Alt-Sysreq-D just prints a small backtrace for keyboard thread and hangs.

in reply to:  9 comment:10 by bonefish, 9 years ago

Replying to diver:

Replying to bonefish:

My keyboard doesn't work in KDL, I do have auto_stack_trace installed, though. Is there a way to modify it to auto execute listport/ports. Also, how should I enter KDL? Alt-Sysreq-D just prints a small backtrace for keyboard thread and hangs.

listport is a command line program. I mentioned the KDL alternative only in case listport cannot be executed/doesn't work correctly in this situation.

With a USB keyboard you have to at least once enter KDL via Alt-SysReq-D, since otherwise it won't work. If it doesn't work anyway (e.g. because it's OHCI) you're out of luck. I'm not aware of any method to easily execute other KDL commands automatically. I think auto_stack_trace is mostly superfluous nowadays, as any panic prints a stack trace anyway and when voluntarily entering KDL one usually can (and wants to) do other/more things. auto_stack_trace cannot be configured. If you build your images yourself, you could hack it to execute another command instead of sc, though.

comment:11 by diver, 9 years ago

I tried to reproduce it several times in hrev42467, but I don't get these crashes anymore. I got new one #7864, though.

comment:12 by diver, 9 years ago

Tried to reproduce it again (by starting mime_update.sh many times) and got fullscreen GDB once again. Unfortunately, my keyboard didn't work there. I tried to reproduce it one more time after reboot but didn't manage to crash it, instead after some time running process started by mime_update.sh stopped quitting. CPU usage was always at max, there was a lot of free memory, though.

by diver, 9 years ago

Attachment: syslog added

comment:13 by diver, 9 years ago

Managed to crash it again, this time with ssh access:

KERN: 60: DEBUGGER: getNumAvailable() == 0
KERN: debug_server: Thread 60 entered the debugger: Debugger call: `getNumAvailable() == 0'
KERN: stack trace, current PC 0xffff0114  :
KERN:   (0x700c1b7c)  0x5f1062  __assert_fail + 0x46
KERN:   (0x700c1b9c)  0x665504  _ZN8BPrivate9hoardHeap23findAvailableSuperblockEiRPNS_5blockEPNS_11processHeapE + 0x34c
KERN:   (0x700c1bec)  0x663431  _ZN8BPrivate10threadHeap6mallocEm + 0xa3
KERN:   (0x700c1c3c)  0x663949  malloc + 0x1b5
KERN:   (0x700c1c5c)  0x5ed4bf  _ZN8BPrivate8KMessage14_AllocateSpaceElbbPPvPl + 0x95
KERN:   (0x700c1c9c)  0x5ed64d  _ZN8BPrivate8KMessage9_AddFieldEPKcmlPNS_13KMessageFieldE + 0x3b
KERN:   (0x700c1cec)  0x5ee261  _ZN8BPrivate8KMessage7AddDataEPKcmPKvlb + 0xa5
KERN:   (0x700c1d4c)  0x22567d  _ZN21AuthenticationManager14_RequestThreadEv + 0x11c3
KERN:   (0x700c1f9c)  0x2272db  _ZN21AuthenticationManager19_RequestThreadEntryEPv + 0x1b
KERN:   (0x700c1fbc)  0x5ea957  thread_entry + 0x1d 
KERN: ps2: possibly a hot plugin of input/keyboard/at/0
KERN: ps2: devfs_publish_device input/keyboard/at/0, status = 0x00000000
KERN: ps2: keyboard found

After that I connected ps2 keyboard which didn't work anyway.

by diver, 9 years ago

Attachment: listport added

comment:14 by diver, 9 years ago

Running kernel_debugger from ssh made my ps2 keyboard work in KDL, see debug_session attachment.

by diver, 9 years ago

Attachment: debug_session added

by diver, 9 years ago

Attachment: listport_form_debug_session added

comment:15 by diver, 9 years ago

This behavior seems to be related to gcc4 as I'm unable to reproduce these issues under gcc2.

comment:16 by diver, 8 years ago

registrar still silently crashes in hrev44065 gcc4 while running:

for i in `seq 1 100000`; do mimeset -apps -f /system/apps/*; done

in 6 Terminals simultaneously. debug_server doesn't seem to catch it. Every open terminal starts printing:

FATAL: be_roster is not valid. Is the registrar running?

registrar is easily restartable using on_exit /system/servers/registrar KDL command.

Last edited 8 years ago by diver (previous) (diff)

comment:17 by diver, 6 years ago

hrev47261 gcc2hybrid.

Out of curiosity I tried the above command with a freshly unpacked archive of BeOS apps archive just to see how things are now two years later. Registrar crashed, but this time this was with gcc2 image and I managed to save debug report.

comment:18 by diver, 6 years ago

Blocking: 11063 added

(In #11063) Seems to be a dupe of #6673.

comment:19 by CodeforEvolution, 4 weeks ago

As of hrev54154+119 on a x86_gcc2 Haiku system, I can't reproduce the registrar crash anymore after stress testing the registrar with up to 30 instances of "for i in seq 1 100000; do mimeset -apps -f /system/apps/*; done" running concurrently.

The only error I notice comes after 8-10 instances of mimeset running concurrently, in which case the Terminal will start printing "mimeset: no more threads", which is less troubling than a crash. After closing the Terminal (which automatically kills all mimeset jobs), everything continues to work as expected: Apps launch, mimetyping works, and no errors are even reported to the syslog. I can even try the stress test again!

Could someone double check on a newer version of the x86_64 version of Haiku? If all is well, I think this commit a while back may have fixed what was going on: https://git.haiku-os.org/haiku/commit/?id=462bfeede0cd123afe2e79d465876289e925ca53

comment:20 by CodeforEvolution, 4 weeks ago

Upon further analysis of the registrar codebase, the “no more threads” error is an expected behavior/safety mechanism to make sure the registrar isn’t doing too much work at once (the current limit is hard-wired into the registrar at 12 concurrent tasks).

Unless anyone else has any problems, I believe this problem was fixed in hrev49570. Diver’s last test was in hrev47261, so I believe the test results may not longer be a problem.

Thus, I believe this critical ticket can be closed! :)

in reply to:  20 comment:21 by nielx, 4 weeks ago

Resolution: fixed
Status: newclosed

Replying to CodeforEvolution:

Upon further analysis of the registrar codebase, the “no more threads” error is an expected behavior/safety mechanism to make sure the registrar isn’t doing too much work at once (the current limit is hard-wired into the registrar at 12 concurrent tasks).

Unless anyone else has any problems, I believe this problem was fixed in hrev49570. Diver’s last test was in hrev47261, so I believe the test results may not longer be a problem.

Thus, I believe this critical ticket can be closed! :)

Thanks for testing! Indeed, let's close it.

comment:22 by madmax, 4 weeks ago

Sorry to comment just after the ticket has been closed.

I've attached a crash report for hrev54447 x86_64. I guess after so many years it's not the same bug, but it certainly is the same reproducer with the same effects.

Maybe I was just lucky, but I got the crash on real hardware when I was still starting those "mimeset bomb loops", and I couldn't reproduce it in three 15-minute runs in qemu-kvm with different smp settings.

comment:23 by nielx, 4 weeks ago

Resolution: fixed
Status: closedreopened

No worries, thanks for posting the stack traces, this will allow us to further investigate the issue.

comment:24 by CodeforEvolution, 4 weeks ago

Thank you for the stack trace! I think two threads must of collided...I see a "General protection fault" on one thread, and a "Call (_numAvailable <= _numBlocks)" debugger call on the other thread. They also were doing the exact same thing, using BIconUtils::GetVectorIcon().

I wonder if these two threads collided because they were working on the same file...

comment:25 by waddlesplash, 4 weeks ago

The second is a malloc assertion, it sounds like heap corruption.

comment:26 by madmax, 4 weeks ago

I tried again after updating in case hrev54529 helped, but still crashed. The stack trace is quite similar this time on getting memory instead of releasing, if I understand it correctly.

I can't debug this myself, but I can apply patches, compile and follow directions. If there's anything I can do to help track this down, just tell me.

comment:27 by CodeforEvolution, 4 weeks ago

That really is strange...as waddlesplash mentioned, that really does look like memory corruption of sorts...and once again, it was in BIconUtils::GetVectorIcon...

Anyway, thank you for these logs! These sort of crashes requires all the info we can get! :)

Note: See TracTickets for help on using tickets.