Opened 8 years ago

Closed 6 years ago

#8275 closed bug (fixed)

Deadlocking apps when opening documents

Reported by: humdinger Owned by: leavengood
Priority: normal Milestone: R1/beta1
Component: System/runtime_loader Version: R1/Development
Keywords: Cc:
Blocked By: Blocking: #8081, #8955
Has a Patch: no Platform: All

Description

This is hrev43630 (but I've seen this with older revisions as well)

When I doubleclick text files (and less frequently images), StyledEdit starts up, gets added to Deskbar but never shows the document. Deadlocked was mmlr's first diagnosis in IRC.

Here are the ProcessController-debugged backtraces of StyledEdit's two threads:

StyledEdit's main thread:

[tcsetpgrp failed in terminal_inferior: Invalid Argument]
[Switching to team /boot/system/apps/StyledEdit (2923) thread StyledEdit (2923)]
0xffff0114 in ?? ()
(gdb) bt
#0  0xffff0114 in ?? ()
#1  0x0084450d in acquire_sem_etc () from /boot/system/lib/libroot.so
#2  0x00431544 in BLocker::AcquireLock () from /boot/system/lib/libbe.so
#3  0x0043137e in BLocker::Lock () from /boot/system/lib/libbe.so
#4  0x00808518 in BPrivate::MutableLocaleRoster::LoadCatalog ()
   from /boot/system/lib/liblocale.so
#5  0x007f3c62 in BCatalog::SetCatalog () from /boot/system/lib/liblocale.so
#6  0x00804e7c in BLocaleRoster::_GetCatalog ()
   from /boot/system/lib/liblocale.so
#7  0x02195555 in BLocaleRoster::GetCatalog ()
   from /boot/system/add-ons/Translators/JPEG2000Translator
#8  0x0216fb31 in __static_initialization_and_destruction_0 ()
   from /boot/system/add-ons/Translators/JPEG2000Translator
#9  0x0216fb96 in global constructors keyed to SSlider::SSlider ()
   from /boot/system/add-ons/Translators/JPEG2000Translator
#10 0x02195648 in __do_global_ctors_aux ()
   from /boot/system/add-ons/Translators/JPEG2000Translator
#11 0x0216c325 in _init ()
   from /boot/system/add-ons/Translators/JPEG2000Translator
#12 0x0010090c in init_dependencies () from /boot/system/runtime_loader
#13 0x00101064 in load_library () from /boot/system/runtime_loader
#14 0x00104222 in export_load_add_on () from /boot/system/runtime_loader
#15 0x00842674 in load_add_on () from /boot/system/lib/libroot.so
#16 0x004f0af5 in BTranslatorRoster::Private::CreateTranslators ()
---Type <return> to continue, or q <return> to quit---
   from /boot/system/lib/libtranslation.so
#17 0x004f0295 in BTranslatorRoster::Private::AddPath ()
   from /boot/system/lib/libtranslation.so
#18 0x004efed6 in BTranslatorRoster::Private::AddDefaultPaths ()
   from /boot/system/lib/libtranslation.so
#19 0x004f3a38 in BTranslatorRoster::AddTranslators ()
   from /boot/system/lib/libtranslation.so
#20 0x004f3985 in BTranslatorRoster::Default ()
   from /boot/system/lib/libtranslation.so
#21 0x004ed911 in BTranslationUtils::GetStyledText ()
   from /boot/system/lib/libtranslation.so
#22 0x0020f0a1 in StyledEditView::GetStyledText ()
#23 0x00217d76 in StyledEditWindow::_LoadFile ()
#24 0x00212be4 in StyledEditWindow::OpenFile ()
#25 0x0020f8fd in StyledEditWindow::StyledEditWindow ()
#26 0x0020e6f1 in StyledEditApp::OpenDocument ()
#27 0x0020e991 in StyledEditApp::RefsReceived ()
#28 0x002ef0b9 in BApplication::DispatchMessage ()
   from /boot/system/lib/libbe.so
#29 0x002f9bf5 in BLooper::task_looper () from /boot/system/lib/libbe.so
#30 0x002edd7a in BApplication::Run () from /boot/system/lib/libbe.so
#31 0x0020ec3b in main ()

StyledEdit's open (file panel, right?) thread:

[tcsetpgrp failed in terminal_inferior: Invalid Argument]
[Switching to team /boot/system/apps/StyledEdit (3725) thread w>StyledEdit: Open (3729)]
0xffff0114 in ?? ()
(gdb) bt
#0  0xffff0114 in ?? ()
#1  0x0010b6c2 in mutex_lock () from /boot/system/runtime_loader
#2  0x001088ad in recursive_lock_lock () from /boot/system/runtime_loader
#3  0x0010159b in get_symbol () from /boot/system/runtime_loader
#4  0x008426f1 in get_image_symbol () from /boot/system/lib/libroot.so
#5  0x00805479 in BPrivate::CatalogAddOnInfo::MakeSureItsLoaded ()
   from /boot/system/lib/liblocale.so
#6  0x00808589 in BPrivate::MutableLocaleRoster::LoadCatalog ()
   from /boot/system/lib/liblocale.so
#7  0x007f3a38 in BCatalog::BCatalog () from /boot/system/lib/liblocale.so
#8  0x00804b81 in BLocaleRoster::GetLocalizedFileName ()
   from /boot/system/lib/liblocale.so
#9  0x00605065 in BPrivate::Model::CacheLocalizedName ()
   from /boot/system/lib/libtracker.so
#10 0x00604ea2 in BPrivate::Model::OpenNodeCommon ()
   from /boot/system/lib/libtracker.so
#11 0x00604ad9 in BPrivate::Model::OpenNode ()
   from /boot/system/lib/libtracker.so
#12 0x00622333 in BPrivate::BPoseView::CreatePoses ()
   from /boot/system/lib/libtracker.so
#13 0x0062374a in BPrivate::BPoseView::MessageReceived ()
   from /boot/system/lib/libtracker.so
#14 0x002f82cb in BLooper::DispatchMessage () from /boot/system/lib/libbe.so
#15 0x003ca241 in BWindow::DispatchMessage () from /boot/system/lib/libbe.so
---Type <return> to continue, or q <return> to quit---
#16 0x005d6291 in BPrivate::TFilePanel::DispatchMessage ()
   from /boot/system/lib/libtracker.so
#17 0x003ce5c8 in BWindow::task_looper () from /boot/system/lib/libbe.so
#18 0x002f97e7 in BLooper::_task0_ () from /boot/system/lib/libbe.so
#19 0x008448d3 in thread_entry () from /boot/system/lib/libroot.so
#20 0x78033fec in ?? ()

Change History (18)

comment:1 by diver, 8 years ago

#8081 might be related.

comment:2 by humdinger, 8 years ago

I was just about to open a new ticket when I saw that the backtraces are similar to the ones in this ticket. Seems like the JPEG is involved somehow; the Translator? Here's the text I was going to put in the new ticket:

I added a JPEG by drag&dropping it into the frame in the People app and saved. Now I cannot open that Person file anymore. It shows up in Deskbar, but doesn't open it's window...

Here are the backtraces of People's two threads from debugging them with ProcessController:

People thread:

[Switching to team /boot/system/apps/People (1697) thread People (1697)]
0xffff0114 in ?? ()
(gdb) bt
#0  0xffff0114 in ?? ()
#1  0x0074750d in acquire_sem_etc () from /boot/system/lib/libroot.so
#2  0x0042d544 in BLocker::AcquireLock () from /boot/system/lib/libbe.so
#3  0x0042d37e in BLocker::Lock () from /boot/system/lib/libbe.so
#4  0x0070b518 in BPrivate::MutableLocaleRoster::LoadCatalog ()
   from /boot/system/lib/liblocale.so
#5  0x006f6c62 in BCatalog::SetCatalog () from /boot/system/lib/liblocale.so
#6  0x00707e7c in BLocaleRoster::_GetCatalog ()
   from /boot/system/lib/liblocale.so
#7  0x02194555 in BLocaleRoster::GetCatalog ()
   from /boot/system/add-ons/Translators/JPEG2000Translator
#8  0x0216eb31 in __static_initialization_and_destruction_0 ()
   from /boot/system/add-ons/Translators/JPEG2000Translator
#9  0x0216eb96 in global constructors keyed to SSlider::SSlider ()
   from /boot/system/add-ons/Translators/JPEG2000Translator
#10 0x02194648 in __do_global_ctors_aux ()
   from /boot/system/add-ons/Translators/JPEG2000Translator
#11 0x0216b325 in _init ()
   from /boot/system/add-ons/Translators/JPEG2000Translator
#12 0x0010090c in init_dependencies () from /boot/system/runtime_loader
#13 0x00101064 in load_library () from /boot/system/runtime_loader
#14 0x00104222 in export_load_add_on () from /boot/system/runtime_loader
#15 0x00745674 in load_add_on () from /boot/system/lib/libroot.so
#16 0x006d8af5 in BTranslatorRoster::Private::CreateTranslators ()
   from /boot/system/lib/libtranslation.so
#17 0x006d8295 in BTranslatorRoster::Private::AddPath ()
   from /boot/system/lib/libtranslation.so
#18 0x006d7ed6 in BTranslatorRoster::Private::AddDefaultPaths ()
   from /boot/system/lib/libtranslation.so
#19 0x006dba38 in BTranslatorRoster::AddTranslators ()
   from /boot/system/lib/libtranslation.so
#20 0x006db985 in BTranslatorRoster::Default ()
   from /boot/system/lib/libtranslation.so
#21 0x00215a89 in PictureView::_LoadPicture ()
#22 0x002139cc in PictureView::Update ()
#23 0x002136fe in PictureView::PictureView ()
#24 0x0020ed39 in PersonView::PersonView ()
#25 0x002114fb in PersonWindow::PersonWindow ()
#26 0x0020e05c in TPeopleApp::_NewWindow ()
#27 0x0020dec4 in TPeopleApp::RefsReceived ()
#28 0x002eb0b9 in BApplication::DispatchMessage ()
   from /boot/system/lib/libbe.so
#29 0x002f5bf5 in BLooper::task_looper () from /boot/system/lib/libbe.so
#30 0x002e9d7a in BApplication::Run () from /boot/system/lib/libbe.so
#31 0x0020c78f in main ()
(gdb) 

w>People: open thread:

[Switching to team /boot/system/apps/People (1753) thread w>People: Open (1758)]
0xffff0114 in ?? ()
(gdb) bt
#0  0xffff0114 in ?? ()
#1  0x0010b6c2 in mutex_lock () from /boot/system/runtime_loader
#2  0x001088ad in recursive_lock_lock () from /boot/system/runtime_loader
#3  0x00100f1f in load_library () from /boot/system/runtime_loader
#4  0x00104222 in export_load_add_on () from /boot/system/runtime_loader
#5  0x00745674 in load_add_on () from /boot/system/lib/libroot.so
#6  0x0070845d in BPrivate::CatalogAddOnInfo::MakeSureItsLoaded ()
   from /boot/system/lib/liblocale.so
#7  0x0070b589 in BPrivate::MutableLocaleRoster::LoadCatalog ()
   from /boot/system/lib/liblocale.so
#8  0x006f6a38 in BCatalog::BCatalog () from /boot/system/lib/liblocale.so
#9  0x00707b81 in BLocaleRoster::GetLocalizedFileName ()
   from /boot/system/lib/liblocale.so
#10 0x005e9065 in BPrivate::Model::CacheLocalizedName ()
   from /boot/system/lib/libtracker.so
#11 0x005e8ea2 in BPrivate::Model::OpenNodeCommon ()
   from /boot/system/lib/libtracker.so
#12 0x005e8ad9 in BPrivate::Model::OpenNode ()
   from /boot/system/lib/libtracker.so
#13 0x00606333 in BPrivate::BPoseView::CreatePoses ()
   from /boot/system/lib/libtracker.so
#14 0x0060774a in BPrivate::BPoseView::MessageReceived ()
   from /boot/system/lib/libtracker.so
#15 0x002f42cb in BLooper::DispatchMessage () from /boot/system/lib/libbe.so
#16 0x003c6241 in BWindow::DispatchMessage () from /boot/system/lib/libbe.so
#17 0x005ba291 in BPrivate::TFilePanel::DispatchMessage ()
   from /boot/system/lib/libtracker.so
#18 0x003ca5c8 in BWindow::task_looper () from /boot/system/lib/libbe.so
#19 0x002f57e7 in BLooper::_task0_ () from /boot/system/lib/libbe.so
#20 0x007478d3 in thread_entry () from /boot/system/lib/libroot.so
#21 0x78033fec in ?? ()
(gdb) 

comment:3 by humdinger, 8 years ago

Found the culprit: Uncheck the option "Translate application and folder names in Deskbar and Tracker" in the Locale preferences and these issues disappear. Note, you don't even have to set a second language, it also happens when there's only the English locale around.

comment:4 by stippi, 8 years ago

Here is my analysis of the problem. I was observing the same behavior and searched Trac if this was already reported.

What happens is that StyledEdit has two threads threads running at the point of the dead-lock, the file panel thread and it's BApplication thread. During initializaton, the file panel thread wants to translate some system file name. As can be seen in the stack trace, MutableLocalRoster calls MakeSureItsLoaded() which in turn looks up an image symbol. At this point, it has already grabbed the default locale roster lock. Concurrently, the app is trying to load the text file. In StyledEdit, this works via translators. As can be seen in the stack trace, the BTranslationRoster is loading Translator add-ons. At some point, it also deals with image and add-on functions (load_add_on()). Presumably, it has at that point grabbed some lock, which the file panel thread also wants to grab in the get_image() function. So this is a classic dead-lock situation because of reversed locking order. One thread first grabs the locale roster lock, then tries to grab an image related lock. The other thread first grabs the image related lock, then tries to grab the locale roster lock. Each thread is waiting for the other one to release the respective lock.

I am just writing this down, since I don't know when I will have time to look into it more. But the solution seems to be to either not hold the locale roster lock when loading locale add-ons, or to make sure the locking order is always the same.

comment:5 by stippi, 8 years ago

One possible solution could be to export the runtime loader global locking functions in runtime_loader_private.h and make sure that the runtime loader lock is already locked before acquiring the default locale roster lock whenever it would deal with add-on and image functionality while holding it.

in reply to:  5 ; comment:6 by bonefish, 8 years ago

Component: Kits/Locale KitSystem/runtime_loader
Owner: changed from pulkomandy to bonefish

Replying to stippi:

One possible solution could be to export the runtime loader global locking functions in runtime_loader_private.h and make sure that the runtime loader lock is already locked before acquiring the default locale roster lock whenever it would deal with add-on and image functionality while holding it.

No, the problem really is the runtime loader locking. It uses a single mutex which it acquires whenever it does, well, anything really. While for the most part that is questionable just for performance reasons, when calling the shared object initialization or finalization functions this is actually a problem, since those may execute any kind of code which doesn't even have to know that is was called in that context.

The runtime loader locking has to be rethought. Maybe a two level locking would already do the trick: One lock to protect the image lists and one lock per image.

comment:7 by diver, 7 years ago

Blocking: 8081 added

(In #8081) Most likely as dupe of #8275.

comment:8 by diver, 7 years ago

Milestone: R1R1/alpha4

in reply to:  6 ; comment:9 by leavengood, 7 years ago

Replying to bonefish:

The runtime loader locking has to be rethought. Maybe a two level locking would already do the trick: One lock to protect the image lists and one lock per image.

There are even TODOs in the runtime_loader's elf.cpp (added by you about 3 years ago) about improving the locking :)

Is there any chance you or anyone else sufficiently skilled could fix this in the next few weeks? If not I could take a stab, but low-level stuff like this isn't my forte. I'm taking a look at alpha4 marked issues, though I suppose this could be ignored for alpha4. It is annoying but it happens pretty infrequently in my testing.

in reply to:  9 comment:10 by bonefish, 7 years ago

Replying to leavengood:

Replying to bonefish: There are even TODOs in the runtime_loader's elf.cpp (added by you about 3 years ago) about improving the locking :)

Yes, there's a TODO mentioning that locking should be improved. I may have added it, but possibly I just converted/reinterpreted a preexisting comment. IIRC we inherited the locking strategy from NewOS and there had already been a comment mentioning that it is suboptimal. Though, I believe that was only considering the performance aspect of the problem (making the runtime loader essentially single-threaded), not that this is actually a functional bug.

Is there any chance you or anyone else sufficiently skilled could fix this in the next few weeks?

I don't think I'll have time to work on that anytime soon. I can't speak for anyone else, but in open source it is usually safe to assume that if you don't do it yourself nobody will. :-)

If not I could take a stab, but low-level stuff like this isn't my forte.

While the runtime loader may be considered low-level, the abstract problem at hand isn't anything you couldn't just as well encounter in your multi-threaded application.

I'm taking a look at alpha4 marked issues, though I suppose this could be ignored for alpha4. It is annoying but it happens pretty infrequently in my testing.

Yeah, well, that might also be depending on the test machine. What happens infrequently on your machine might happen relatively often on others and not at all on yet others.

BTW, it should be pretty easy to implement a reliable test case, if that helps. In pseudo-code:

    static BLocker sLock;

    static struct Foo {
        Foo()
        {
            start new thread in thread_function();
            sleep 1s;
            sLock.Lock();
            sLock.Unlock();
        }
    } sFoo;

    thread_function()
    {
        sLock.Lock();
        call any runtime loader service, e.g. `get_image_symbol()`;
        sLock.Unlock();
    }

    main() {}

The Foo constructor tries to lock sLock while holding the runtime loader lock. thread_function() tries to lock the runtime loader lock while holding sLock.

In case you or someone else wants to tackle the issue, I'll try and give and overview of the situation:

  • There's a global list, sLoadedImages plus sLoadedImageCount in images.cpp, which contains objects (the loaded images, image_t). This list is constructed while loading the program, but it can change later on as add-ons or shared libraries are dynamically loaded and unloaded.
  • That's an image's lifetime:
    • The image is loaded, that is the ELF file header is processed, the segments are mapped into memory, an image object is created and added to the global list (sLoadedImages).
    • The image's dependencies are loaded. Like the first step, just for all shared libraries the first image depends on, recursively. There's a ref-counting in the image objects. A dependency is only loaded once. Its reference count is incremented for each other image that depends on it.
    • The image and the images it depends on are relocated (only the dependencies that haven't been relocated yet). This happens in topological order, though I think the order doesn't actually matter in this case.
    • The read-only segments of the images are remapped read-only (needed to be writable for relocation).
    • The image and its dependencies are initialized (only the not yet initialized ones). Initialization means, that the initialization function of the image is called, the one that calls static constructors and the like, i.e. this calls code in the image, outside the runtime loader. The initialization happens bottom up, dependencies before images depending on them.
    • The image object remains unchanged until the image is unloaded. Unloading an image only decrements its reference count. Unless it drops to 0, nothing else happens. Otherwise ...
    • The image is moved from sLoadedImages to sDisposableImages.
    • For the images it depends on the previous two steps and this one are executed.
    • For each image that ends up in sDisposableImages (i.e. has a ref count of 0) its deinitialization function is called (calling static destructors etc.), the image is unmapped, and the image object is destroyed.
  • After loading the program image and its dependencies the runtime loader is essentially a library, providing the following kinds of services to the program:
    • Dynamically loading/unloading an add-on/library.
    • Looking up a symbol by name in a given library or globally.
    • Reverse looking up a symbol by address.
    • Iterating through loaded images.
    • Iterating through the symbols of an image (actually looking one up by index).
    • Iterating through the direct dependencies of an image.
  • There's a single global runtime loader lock, which is held when the program is initially loaded and when any of the provided services is performed. In particular it is always held when calling an image initialization or deinitialization function.

The latter is the main issue. Since an image must be fully initialized before it can be used, just dropping the lock before calling the initialization function is not possible. An image's initialization function must not be called before the initialization function of all dependencies have been called (and returned). Analogously the deinitialization function of an image must not be called before the deinitialization function of all images depending on it have been called (and returned). Consequently:

  • The single global lock approach doesn't work. A global lock for the image list is necessary -- it should become an innermost lock -- but additionally one or more locks per image are needed.
  • The image lists compiled when loading an image (the image plus its dependencies) and when unloading it (sDisposableImages) probably won't work anymore. I think a more dynamic approach is necessary -- namely depth first search with dynamic locking. The already existing ref-counting will help a lot, since it can make sure that image objects won't die while playing with them without holding the global lock. Note that while it looks like the images could be individually fully loaded (mapped, relocated, and initialized) in a bottom up manner, it is necessary to keep the phases separate. Relocation must wait until all images are mapped and in the search list for symbols, so that undefined library symbols (defined in the app) and symbol preemption are supported correctly. Analogously initialization must wait until all images have been relocated (since the initialization function may use an undefined or preempted symbol).
  • IIRC (needs checking), after putting an image object in the global list, the image object is basically immutable until the image is about to be unloaded. That could be made use of, e.g. by using a lockless strategy (for the members that is), read-write locking, or "init once" locking primitives. There are exceptions, though:
    • The image_t::flags field contains bits for relocation and initialization status, which change when/after relocating/initializing the image. One or more other bits may be used temporarily for algorithms. This needs to be considered/changed accordingly.
    • After fork() the image IDs of the child process differ from those of the parent. The (still old) ID stored in the image structures is updated lazily (update_image_ids()).

In case this isn't clear already, this task is probably a bit of work (as in more than a few hours), though it will depend on how smoothly things go.

comment:11 by humdinger, 7 years ago

Hey, Ingo. Think you could turn your last comment into a Blog-O-Sphere post. Seems to be a lot of interesting info that would be lost and forgotten once Ryan has fixed this ticket. :)

in reply to:  11 comment:12 by bonefish, 7 years ago

Replying to humdinger:

Hey, Ingo. Think you could turn your last comment into a Blog-O-Sphere post. Seems to be a lot of interesting info that would be lost and forgotten once Ryan has fixed this ticket. :)

I don't find a blog post all that nice either. Sure, one can search the site and possibly find the post, but I'd much prefer a documentation of the OS internals in the wiki.

comment:13 by humdinger, 7 years ago

OK, the wiki it is then. :)

comment:14 by leavengood, 7 years ago

Maybe I'll take a stab at fixing this some day given Ingo's extensive description, but I don't think I'll get it done now for alpha4.

comment:15 by diver, 7 years ago

Blocking: 8955 added

(In #8955) Seems like a dupe of #8275.

comment:16 by scottmc, 7 years ago

Milestone: R1/alpha4R1/beta1
Owner: changed from bonefish to leavengood
Status: newassigned

sliding this one out to R1/beta, and giving it to leavengood.

comment:17 by humdinger, 6 years ago

On hrev46645 I can't reproduce anymore. I'll keep the translation of Deskbar/Tracker setting on for some time now (maybe others do to) and if the issue does not crop up anymore, maybe it's been fixed on the way...

comment:18 by humdinger, 6 years ago

Resolution: fixed
Status: assignedclosed

Closing ticket as apparently fixed. Re-open, if it's not...

Note: See TracTickets for help on using tickets.