Opened 15 months ago

Closed 15 months ago

Last modified 6 months ago

#15211 closed bug (fixed)

KDL in MultiLocker::ReadLock()

Reported by: 3dEyes Owned by: waddlesplash
Priority: normal Milestone: R1/beta2
Component: System/Kernel Version: R1/Development
Keywords: Cc: mmlr, korli
Blocked By: Blocking: #15227, #15229
Platform: All

Description (last modified by 3dEyes)

hrev53313+1 I got a spontaneous KDL (twice). The issue may be related to changes in the hrev53305 commit.

KDL photo:

Attachments (5)

photo_2019-07-30_20-35-45.jpg (125.4 KB ) - added by 3dEyes 15 months ago.
photo_2019-08-01_21-41-59.jpg (195.2 KB ) - added by 3dEyes 15 months ago.
previous_syslog (423.2 KB ) - added by 3dEyes 15 months ago.
photo_2019-08-01_21-41-05.jpg (150.8 KB ) - added by 3dEyes 15 months ago.
syslog (130.6 KB ) - added by diver 15 months ago.

Download all attachments as: .zip

Change History (32)

by 3dEyes, 15 months ago

comment:1 by 3dEyes, 15 months ago

Description: modified (diff)

comment:2 by 3dEyes, 15 months ago

Description: modified (diff)

comment:3 by waddlesplash, 15 months ago

Please attach a syslog.

comment:4 by waddlesplash, 15 months ago

Cc: mmlr korli added
Component: - GeneralSystem/Kernel

_user_block_thread only hits one mutex -- the thread's. So I'm not sure how this is somehow turning into a GPE; the thread mutex should obviously not be destroyed or otherwise altered while the thread is alive. (And since this is in unlock, the mutex lock obviously did not cause a problem.)

The "isr_array (nearest)" seems suspicious. That's presumably here. Are we running past the end of it or something?

CC korli and mmlr: do either of you have an idea as to what might be occurring here? This is a little over my head, still.

comment:5 by korli, 15 months ago

0xcccccccc00000001 is in the non-canonical address hole on x86_64, accessing it triggers a GPE.

From https://software.intel.com/sites/default/files/managed/39/c5/325462-sdm-vol-1-2abcd-3abcd.pdf :

3.3.7.1 Canonical Addressing
In 64-bit mode, an address is considered to be in canonical form if address bits 63 through to the most-significant
implemented bit by the microarchitecture are set to either all ones or all zeros.

Intel 64 architecture defines a 64-bit linear address. Implementations can support less. The first implementation of
IA-32 processors with Intel 64 architecture supports a 48-bit linear address. This means a canonical address must
have bits 63 through 48 set to zeros or ones (depending on whether bit 47 is a zero or one).

Although implementations may not use all 64 bits of the linear address, they should check bits 63 through the
most-significant implemented bit to see if the address is in canonical form. If a linear-memory reference is not in
canonical form, the implementation should generate an exception. In most cases, a general-protection exception
(#GP) is generated. However, in the case of explicit or implied stack references, a stack fault (#SS) is generated. 
Last edited 15 months ago by korli (previous) (diff)

comment:6 by waddlesplash, 15 months ago

Ah. Well, that address looks like garbage, where did it come from? Is this memory corruption, maybe?

threedeyes, can you reproduce this?

comment:7 by 3dEyes, 15 months ago

It all started after the upgrade to the specified revision. The KDL is spontaneously in various situations. The first time I started the recipe build: hp wonderbrush (at the time of downloading the archive). The second time this happened was when I worked in Pe and switched the active window to the Kate editor. I did a memory test - everything is ok. I could not get syslog yet - I will try to save it if it happens again.

comment:8 by mmlr, 15 months ago

The address contains 0xcccccccc which would match the "uninitialized memory" pattern written by the normal allocator. Since the lock cannot really be uninitialized (as one would expect the corresponding lock to fail in that case), this might be a use-after-free case where the allocation has been handed out again already. Is it possible that the thread went away due to a missing reference or some such?

The isr_array (nearest) is a red herring as (nearest) always just means that the symbol lookup did not find an actual match. This can happen in some generated and/or optimized code and usually means that it's not actually where it suggest it might be.

by 3dEyes, 15 months ago

by 3dEyes, 15 months ago

Attachment: previous_syslog added

comment:9 by 3dEyes, 15 months ago

I got the same KDL again. Screenshot and syslog attached.

comment:10 by waddlesplash, 15 months ago

This is actually the opposite panic; it's occuring on unlock, not lock. And it's not a GPE, it seems to be one of the assert failures in the mutex code; but you've typed bt again (why? There's already a backtrace printed automatically...) and so the actual message is cut off.

This is much more interesting than the last one. The thread_block syscall only operates on the current thread, which there really is no way of destroying while it is running, and so the thread structure getting corrupted is about the only option. Since this is now occurring inside GetAndLock, this means the thread is still in the hash, but the mutex is indeed destroyed or corrupted. So then this is pretty much proof there is memory corruption.

threedeyes, can you please do a guarded heap build of the kernel, and additionally change the thread structs to be allocated from the heap rather than their object cache? That should hopefully catch whatever is corrupting memory here.

by 3dEyes, 15 months ago

comment:11 by 3dEyes, 15 months ago

but you've typed bt again (why? There's already a backtrace printed automatically...)

Ooops, sorry. Original photo attached.

can you please do a guarded heap build of the kernel, and additionally change the thread structs to be allocated from the heap rather than their object cache?

I try.

So then this is pretty much proof there is memory corruption.

I've run the memory test three times , it's okay.

comment:12 by waddlesplash, 15 months ago

Oh, I doubt it's a case of bad memory, it's probably a memory bug in the kernel itself.

comment:13 by mmlr, 15 months ago

The panic message says that the mutex is being released from the wrong thread. The thread ids 901 and 903 are reasonably close, so that it seems plausible. I would not conclude memory corruption from this. If it is indeed the case that the locks are being triggered from the wrong threads, all sorts of strange things may happen.

comment:14 by waddlesplash, 15 months ago

That also does not make sense here. How is this one use case triggering the problem?

comment:15 by tqh, 15 months ago

I don't think the current multilocker looks safe. The check if it is locked and the action afterward is not done in any kind of transaction. No lock, no mem barrier. So when the action is done it might have already changed state by another thread. If it is intentional that several threads see it as unlocked and then start locking it should be documented. I also see problems if several threads would call destruction.

I don't think these kind of changes should be done this lightly, and I would like to see some tests to guarantee that it has been proven to work before commited.

You should be able to test this with mock threads and sleeps right after each lock check return otherwise you will have subtle errors like this where lock thread will be updated as soon as sem is relaased as others hang there.

comment:16 by waddlesplash, 15 months ago

I'm confused. The new MultiLocker just uses rw_lock; is that what you are referring to that has problems here? My new MultiLocker implementation definitely doesn't need memory barriers or the like, the rw_lock should be more than enough.

Even if those things were somehow a problem here, this is crashing in the kernel. Under no circumstances should userland be able to trigger an assert failure or GPE or anything else like that in the kernel. So this is a kernel bug no matter what.

comment:17 by tqh, 15 months ago

So you ask should I grab the lock or do something else depending on the state. The state however changes at any point as any other thread may lock or unlock in the meantime.

If it is not locked then you should lock, but another thread asked the same and got the same answer. So you have two threads that think that they are the ones that should lock. If it was done as an transaction (ie protected by a lock) the other thread would see that it was locked already).

Look at Ingo's RWLocker for how he has a lock to protect and do the action in one transaction.

comment:18 by waddlesplash, 15 months ago

I still don't know what you are referring to here. The IsWriteLocked is not used to determine if the lock is "busy", it is used to determine if the current thread holds the lock. If some other thread holds the lock, it returns false.

comment:19 by tqh, 15 months ago

so it is an isWriteLockOwner function?

comment:20 by waddlesplash, 15 months ago

It is the equivalent of BLocker::IsLocked which has identical semantics.

comment:21 by tqh, 15 months ago

Yes, unfortunate naming. I looked at the debug functions as well, not the proper ones. using a rw lock looks perfect. But do we want to define DEBUG if it is not defined?

comment:22 by waddlesplash, 15 months ago

Blocking: 15227 added

comment:23 by waddlesplash, 15 months ago

Blocking: 15229 added

comment:24 by waddlesplash, 15 months ago

I've just closed #15227 and #15229 as duplicates of this. #15227 appears to be nearly identical to the second KDL above; #15229 has the same message, but interestingly it is occuring inside the port heap, not the user-thread syscalls. So that is especially odd.

comment:25 by diver, 15 months ago

hrev53340+2 32bit.

All of a sudden got:

KERN: PANIC: ASSERT FAILED (../../haiku/src/system/kernel/locks/lock.cpp:785): lock->holder == waiter.thread->id.

The panic was continuable, so I saved a syslog,

by diver, 15 months ago

Attachment: syslog added

comment:26 by waddlesplash, 15 months ago

Resolution: fixed
Status: assignedclosed

Fixed in hrev53345.

comment:27 by nielx, 6 months ago

Milestone: UnscheduledR1/beta2

Assign tickets with status=closed and resolution=fixed within the R1/beta2 development window to the R1/beta2 Milestone

Note: See TracTickets for help on using tickets.