Opened 17 months ago

Last modified 7 weeks ago

#18451 new bug

Assertion failure (mutex was not actually locked) in libroot hoard malloc

Reported by: waddlesplash Owned by: nobody
Priority: normal Milestone: R1/beta6
Component: System/libroot.so Version: R1/beta4
Keywords: Cc:
Blocked By: #13554 Blocking: #18472, #18477
Platform: All

Description

This triggers unreliably (1 in 5-10 or so) with certain applications, following the addition of those assertions (the new mutex code is not needed to trigger it.)

The stack trace after the assert is always just this:

		0x7f6d30f69b60	0xce64b42d98	BPrivate::processHeap::free(void*) + 0x138 
		0x7f6d30f69b90	0xce64b43f84	free + 0x44 
...

Change History (18)

comment:1 by waddlesplash, 17 months ago

Things I have tried so far:

  1. using pthread_mutex instead of our own built-in mutex, and changing its error checks to assert()s also. This just makes the assertion failure be always:
    		state: Call (mutex->owner == find_thread(NULL))
    
  1. Using the "unused" area as a flags field, and trying to catch any unlock()s when a flag is set. No results.
  1. Adding assertions to ~superblock() (but is this ever called?) No results.

comment:2 by waddlesplash, 17 months ago

The most reliable ways to trigger this problem are with QtWebEngine and GTKWebKit. With QtWebEngine, CanvasMark is a reliable reproducer; with GTKWebKit, Haiku's own forums seem to trigger it.

comment:3 by waddlesplash, 17 months ago

Oho, it's possible to trigger it using WebPositive and trying to access the Haiku forums also, not just GTKWebKit.

comment:4 by pulkomandy, 17 months ago

The full stacktrace would be useful. This may just be some kind of heap corruption.

in reply to:  4 comment:5 by korli, 17 months ago

Replying to pulkomandy:

The full stacktrace would be useful. This may just be some kind of heap corruption.

Example here.

	thread 3597: pthread func 
		state: Call (mutex was not actually locked!)

		Frame		IP			Function Name
		-----------------------------------------------
		00000000	0xce64ab8da7	_kern_debugger + 0x7 
			Disassembly:
				_kern_debugger:
				0x000000ce64ab8da0:   48c7c0e5000000  mov $0xe5, %rax
				0x000000ce64ab8da7:             0f05  syscall  <--

		0x7f6d30f69b60	0xce64b42d98	BPrivate::processHeap::free(void*) + 0x138 
		0x7f6d30f69b90	0xce64b43f84	free + 0x44 
		0x7f6d30f69bf0	0x14afeb6e206	_ZNSt6vectorISt4pairIt13scoped_refptrIN2cc4TaskEEESaIS5_EE17_M_realloc_insertIJS5_EEEvN9__gnu_cxx17(, ) + 0x116 
		0x7f6d30f69c60	0x14afeb703c9	cc::TaskGraphWorkQueue::GetNextTaskToRun(unsigned short) + 0x269 
		0x7f6d30f69d30	0x14b0059dd0b	content::CategorizedWorkerPool::RunTaskInCategoryWithLockAcquired(cc::TaskCategory) + 0x2b 
		0x7f6d30f69d80	0x14b0059dfea	content::CategorizedWorkerPool::Run(std::vector<cc::TaskCategory, std::allocator<cc::TaskCategory> > const&, base::ConditionVariable*) + 0x6a 
		0x7f6d30f69db0	0x14afe2e5b48	base::_GLOBAL__N_1::ThreadFunc(void*) + 0x48 
		0x7f6d30f69dd0	0xce64ac7105	pthread_thread_entry(void*, void*) + 0x15 
		00000000	0x7faeead5b258	commpage_thread_exit + 0 

comment:6 by pulkomandy, 17 months ago

So we're looking at this lock, I guess:

https://cgit.haiku-os.org/haiku/tree/src/system/libroot/posix/malloc_hoard2/processheap.cpp#n203

Notice how the superblock is retrieved from the memory block being freed:

superblock *sb = b->getSuperblock();

This means a corrupt block (either because some data was overwritten, or because the software is trying to free memory that was not allocated by malloc) will result in a pointer not pointing to a superblock at all.

There is an assert to check if the superblock is valid, but the validation done is quite weak:

  • Two values (numBlocks and sizeClass) must be greater than zero (effectively checking only the sign bit)
  • numAvailable must be less than numBlocks

(block::isValid which is called earlier in the function is even worse, it is just a "return 1" and there isn't much more that can be checked)

HEAP_DEBUG is not set so we don't have the _magic field which could be used for a more reliable check. And the isValid function in the superblock does not even try to uses that anyways.

So, is it possible that we are simply looking at an invalid superblock, and the value being used is not at all a mutex, because we didn't actually get a pointer to a superblock?

comment:7 by waddlesplash, 17 months ago

Blocking: 18472 added

comment:8 by waddlesplash, 17 months ago

Blocking: 18477 added

comment:9 by waddlesplash, 17 months ago

If that were the case, we would have problems occur earlier, because the value wouldn't be an expected mutex value, I think?

Like I noted above in the first comment, using a pthread_mutex instead confirms that the "owner" field is modified by something else between when we lock and when we unlock.

comment:10 by pulkomandy, 17 months ago

because the value wouldn't be an expected mutex value

What do you call an "expected mutex value" and where would be the code checking for it?

There are not many checks inside __mutex_unlock. Any value with B_USER_MUTEX_WAITING == 0 and B_USER_MUTEX_LOCKED == 1 will hit this assert. So that's just testing 2 bits, if the value is random, you have one chance out of 4 of hitting this.

comment:11 by waddlesplash, 16 months ago

As I noted above, I tested this with pthread mutexes instead, which have an "owner" field that is checked both in the lock and unlock routines. In unlock, it doesn't assert by default, but I upgraded that in my test builds.

The assertion in the lock routine does not fire (so, at least it looks like a valid mutex at that point), but it does in the unlock (so, it has been overwritten between when we lock it and when we unlock it.)

comment:12 by waddlesplash, 16 months ago

Milestone: UnscheduledR1/beta5

I downgraded the assertion into a syslog print in hrev57121. Adding to b5 milestone so we don't forget about this, though.

comment:13 by waddlesplash, 9 months ago

Apparently the test cases in #6673 trigger this quite reliably.

comment:14 by madmax, 9 months ago

Some more logging shows that it is indeed the "hoard superblock" lock and unlocking without the lock held comes from mutex_init_etc being called while the lock is held. The only initialization I've seen is in the superblock constructor, which is called with a specific buffer allocation. I guess there's a chance for that to happen if the superblock memory is returned to the pool inside the freeBlock call in the only section that seems to be protected by that lock (the one in processHeap::free).

comment:15 by nephele, 5 months ago

Milestone: R1/beta5R1/beta6

comment:16 by pulkomandy, 5 months ago

Milestone: R1/beta6R1/beta5

Please don't remove things from the milestone without a bit of discussion.

It would be really nice to take a closer look at these things. They are not set as blockers so there isn't an urgent need to move them yet. We can do so when the release is shipped.

I appreciate the efforts to move things forwards, but doing so just by moving the bugs to another milestone isn't really making progress...

comment:17 by waddlesplash, 7 weeks ago

Blocked By: 13554 added

comment:18 by waddlesplash, 7 weeks ago

Milestone: R1/beta5R1/beta6

move remaining tickets to beta6

Note: See TracTickets for help on using tickets.