#16175 closed bug (fixed)
Various KDL when using contextual menu in Tracker
Reported by: | TheClue | Owned by: | nobody |
---|---|---|---|
Priority: | normal | Milestone: | R1/beta2 |
Component: | System/Kernel | Version: | R1/Development |
Keywords: | Cc: | mmlr | |
Blocked By: | Blocking: | ||
Platform: | All |
Description (last modified by )
I've started to suffer several different KDL I never did before on my system.
All of them happen when I use the contextual menu in Tracker to open a file in StyledEdit and when I have these applications opened:
- notepadqq
- webpositive
- otter
- terminal
- vision
But the messages differ so I cannot ensure these KDLs are related (this one in particular seems a Web+ one, indeed).
I was not able to screenshot all of them (now I can), so I'm starting with this one. I'll attach others as they come.
Attachments (10)
Change History (24)
by , 4 years ago
Attachment: | photo_2020-05-31_14-57-42.jpg added |
---|
comment:1 by , 4 years ago
Description: | modified (diff) |
---|
comment:2 by , 4 years ago
Description: | modified (diff) |
---|
comment:3 by , 4 years ago
If you see the "Invalid Opcode Exception" again, please run "sc -d" at the KDL prompt and take a photo of that.
follow-up: 5 comment:4 by , 4 years ago
Cc: | added |
---|---|
Keywords: | kdl removed |
Platform: | x86-64 → All |
mmlr's recent changes are the most likely culprit. What hrev are you running, TheClue?
comment:5 by , 4 years ago
Replying to waddlesplash:
mmlr's recent changes are the most likely culprit. What hrev are you running, TheClue?
by , 4 years ago
Attachment: | photo_2020-05-31_19-36-28.jpg added |
---|
Another KDL. Again, when using the contextual menu for a StyledEdit file
comment:6 by , 4 years ago
Interesting. The message in the last screen is cut off, I presume it was again a "General Protection Exception"?
It's not the same exception, but both would be something you get from executing bad memory. That would suggest that some memory range got mis-mapped or some caches aren't properly invalidated.
It is somewhat suspicious that you have it twice in a kernel side allocation right during page fault handling. The possibly related changes are mostly on the user address space handling, so even if there was a grave error that messes up the user address space, it shouldn't fault in the kernel afterwards.
Maybe the AreaRangeIterator doesn't iterate properly, but I re-reviewed it and I don't see how that would happen.
Could also be the VMCache::_FreePageRange() refactor freeing something it shouldn't, but in the worst case you should get new page faults, not GPEs or invalid opcodes.
Generally all the changes are hit pretty commonly so it's odd that I didn't see anything in my testing. I've stress tested these pretty well in various memory conditions and use cases. Maybe it exposes a preexisting race condition that is now more likely because LookupArea is faster or some such. What hardware is this on?
#16168 may be related, but there's not enough information there to know.
by , 4 years ago
Attachment: | photo_2020-06-01_00-05-49.jpg added |
---|
sc -d output (i think i could continue for the whole night...)
by , 4 years ago
Attachment: | photo_2020-06-01_00-40-57.jpg added |
---|
Invalid opcode exception (again, using the context menu)
comment:8 by , 4 years ago
There's no good reason to suspect disk corruption from these KDLs. mmlr's analysis above also has no reason to suspect it. Additionally, based on mmlr, TheClue, and I's discussion on IRC, it seems that this is some kind of race.
comment:9 by , 4 years ago
Finally I was able to get a GPE that could be more useful for the diagnosis. In facts, both the other two running threads were trapped in kernel space. Screenshot attached.
by , 4 years ago
Attachment: | photo_2020-06-02_00-02-45.jpg added |
---|
Other thread/1 trapped in kernelspace: empty_magazine
by , 4 years ago
Attachment: | photo_2020-06-02_00-04-07.jpg added |
---|
Other thread/2 trapped in kernel space: map_page
comment:10 by , 4 years ago
I happened to be able to find this with the kernel guarded heap and it turned out to be an off-by-one error and should be fixed by this change:
https://review.haiku-os.org/c/haiku/+/2860
So it indeed was the "open with" context menu, due to it running a query.
As for why it was so well reproducible on your end, that was probably related to the length of the query. It would need to be a certain length to actually run over the allocation due to implicit alignment and slab sizes. The query length depends on the installed apps that claim support for the mimetype.
It is still somewhat curious that it apparently would always corrupt the page mapping object cache structures. Obviously the result of corrupting random kernel heap data is pretty unpredictable.
comment:12 by , 4 years ago
hrev54287, problem seems solved here. I'll keep testing during regular use, though.
Lemme congratulate to you all. A so nasty, deeply buried, exotic bug spot and fixed in less than 2 days. That's magic :O
comment:13 by , 4 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
No, your KDL pictures from yesterday look like the same thing to me anyway. Excellent!
comment:14 by , 4 years ago
Milestone: | Unscheduled → R1/beta2 |
---|
Assign tickets with status=closed and resolution=fixed within the R1/beta2 development window to the R1/beta2 Milestone
(final time)
KDL 1