Opened 18 months ago
Last modified 2 months ago
#18451 new bug
Assertion failure (mutex was not actually locked) in libroot hoard malloc
Reported by: | waddlesplash | Owned by: | nobody |
---|---|---|---|
Priority: | normal | Milestone: | R1/beta6 |
Component: | System/libroot.so | Version: | R1/beta4 |
Keywords: | Cc: | ||
Blocked By: | #13554 | Blocking: | #18472, #18477 |
Platform: | All |
Description
This triggers unreliably (1 in 5-10 or so) with certain applications, following the addition of those assertions (the new mutex code is not needed to trigger it.)
The stack trace after the assert is always just this:
0x7f6d30f69b60 0xce64b42d98 BPrivate::processHeap::free(void*) + 0x138 0x7f6d30f69b90 0xce64b43f84 free + 0x44 ...
Change History (18)
comment:1 by , 18 months ago
comment:2 by , 18 months ago
The most reliable ways to trigger this problem are with QtWebEngine and GTKWebKit. With QtWebEngine, CanvasMark is a reliable reproducer; with GTKWebKit, Haiku's own forums seem to trigger it.
comment:3 by , 18 months ago
Oho, it's possible to trigger it using WebPositive and trying to access the Haiku forums also, not just GTKWebKit.
follow-up: 5 comment:4 by , 18 months ago
The full stacktrace would be useful. This may just be some kind of heap corruption.
comment:5 by , 18 months ago
Replying to pulkomandy:
The full stacktrace would be useful. This may just be some kind of heap corruption.
Example here.
thread 3597: pthread func state: Call (mutex was not actually locked!) Frame IP Function Name ----------------------------------------------- 00000000 0xce64ab8da7 _kern_debugger + 0x7 Disassembly: _kern_debugger: 0x000000ce64ab8da0: 48c7c0e5000000 mov $0xe5, %rax 0x000000ce64ab8da7: 0f05 syscall <-- 0x7f6d30f69b60 0xce64b42d98 BPrivate::processHeap::free(void*) + 0x138 0x7f6d30f69b90 0xce64b43f84 free + 0x44 0x7f6d30f69bf0 0x14afeb6e206 _ZNSt6vectorISt4pairIt13scoped_refptrIN2cc4TaskEEESaIS5_EE17_M_realloc_insertIJS5_EEEvN9__gnu_cxx17(, ) + 0x116 0x7f6d30f69c60 0x14afeb703c9 cc::TaskGraphWorkQueue::GetNextTaskToRun(unsigned short) + 0x269 0x7f6d30f69d30 0x14b0059dd0b content::CategorizedWorkerPool::RunTaskInCategoryWithLockAcquired(cc::TaskCategory) + 0x2b 0x7f6d30f69d80 0x14b0059dfea content::CategorizedWorkerPool::Run(std::vector<cc::TaskCategory, std::allocator<cc::TaskCategory> > const&, base::ConditionVariable*) + 0x6a 0x7f6d30f69db0 0x14afe2e5b48 base::_GLOBAL__N_1::ThreadFunc(void*) + 0x48 0x7f6d30f69dd0 0xce64ac7105 pthread_thread_entry(void*, void*) + 0x15 00000000 0x7faeead5b258 commpage_thread_exit + 0
comment:6 by , 18 months ago
So we're looking at this lock, I guess:
https://cgit.haiku-os.org/haiku/tree/src/system/libroot/posix/malloc_hoard2/processheap.cpp#n203
Notice how the superblock is retrieved from the memory block being freed:
superblock *sb = b->getSuperblock();
This means a corrupt block (either because some data was overwritten, or because the software is trying to free memory that was not allocated by malloc) will result in a pointer not pointing to a superblock at all.
There is an assert to check if the superblock is valid, but the validation done is quite weak:
- Two values (numBlocks and sizeClass) must be greater than zero (effectively checking only the sign bit)
- numAvailable must be less than numBlocks
(block::isValid which is called earlier in the function is even worse, it is just a "return 1" and there isn't much more that can be checked)
HEAP_DEBUG is not set so we don't have the _magic field which could be used for a more reliable check. And the isValid function in the superblock does not even try to uses that anyways.
So, is it possible that we are simply looking at an invalid superblock, and the value being used is not at all a mutex, because we didn't actually get a pointer to a superblock?
comment:7 by , 17 months ago
Blocking: | 18472 added |
---|
comment:8 by , 17 months ago
Blocking: | 18477 added |
---|
comment:9 by , 17 months ago
If that were the case, we would have problems occur earlier, because the value wouldn't be an expected mutex value, I think?
Like I noted above in the first comment, using a pthread_mutex instead confirms that the "owner" field is modified by something else between when we lock and when we unlock.
comment:10 by , 17 months ago
because the value wouldn't be an expected mutex value
What do you call an "expected mutex value" and where would be the code checking for it?
There are not many checks inside __mutex_unlock
. Any value with B_USER_MUTEX_WAITING == 0
and B_USER_MUTEX_LOCKED == 1
will hit this assert. So that's just testing 2 bits, if the value is random, you have one chance out of 4 of hitting this.
comment:11 by , 17 months ago
As I noted above, I tested this with pthread mutexes instead, which have an "owner" field that is checked both in the lock and unlock routines. In unlock, it doesn't assert by default, but I upgraded that in my test builds.
The assertion in the lock routine does not fire (so, at least it looks like a valid mutex at that point), but it does in the unlock (so, it has been overwritten between when we lock it and when we unlock it.)
comment:12 by , 17 months ago
Milestone: | Unscheduled → R1/beta5 |
---|
I downgraded the assertion into a syslog print in hrev57121. Adding to b5 milestone so we don't forget about this, though.
comment:14 by , 10 months ago
Some more logging shows that it is indeed the "hoard superblock" lock and unlocking without the lock held comes from mutex_init_etc being called while the lock is held. The only initialization I've seen is in the superblock constructor, which is called with a specific buffer allocation. I guess there's a chance for that to happen if the superblock memory is returned to the pool inside the freeBlock
call in the only section that seems to be protected by that lock (the one in processHeap::free
).
comment:15 by , 6 months ago
Milestone: | R1/beta5 → R1/beta6 |
---|
comment:16 by , 6 months ago
Milestone: | R1/beta6 → R1/beta5 |
---|
Please don't remove things from the milestone without a bit of discussion.
It would be really nice to take a closer look at these things. They are not set as blockers so there isn't an urgent need to move them yet. We can do so when the release is shipped.
I appreciate the efforts to move things forwards, but doing so just by moving the bugs to another milestone isn't really making progress...
comment:17 by , 2 months ago
Blocked By: | 13554 added |
---|
Things I have tried so far:
~superblock()
(but is this ever called?) No results.