Opened 4 years ago

Closed 4 years ago

Last modified 4 years ago

#16175 closed bug (fixed)

Various KDL when using contextual menu in Tracker

Reported by: TheClue Owned by: nobody
Priority: normal Milestone: R1/beta2
Component: System/Kernel Version: R1/Development
Keywords: Cc: mmlr
Blocked By: Blocking:
Platform: All

Description (last modified by TheClue)

I've started to suffer several different KDL I never did before on my system.

All of them happen when I use the contextual menu in Tracker to open a file in StyledEdit and when I have these applications opened:

  • notepadqq
  • webpositive
  • otter
  • terminal
  • vision

But the messages differ so I cannot ensure these KDLs are related (this one in particular seems a Web+ one, indeed).

I was not able to screenshot all of them (now I can), so I'm starting with this one. I'll attach others as they come.

Attachments (10)

photo_2020-05-31_14-57-42.jpg (275.1 KB ) - added by TheClue 4 years ago.
KDL 1
photo_2020-05-31_15-40-18.jpg (166.7 KB ) - added by TheClue 4 years ago.
Again, using the contextual menu
photo_2020-05-31_19-36-28.jpg (126.4 KB ) - added by TheClue 4 years ago.
Another KDL. Again, when using the contextual menu for a StyledEdit file
syslog (307.4 KB ) - added by TheClue 4 years ago.
syslog, business as usual here
listdev.txt (2.4 KB ) - added by TheClue 4 years ago.
listdev
photo_2020-06-01_00-05-17.jpg (171.0 KB ) - added by TheClue 4 years ago.
contextual menu...again…
photo_2020-06-01_00-05-49.jpg (181.2 KB ) - added by TheClue 4 years ago.
sc -d output (i think i could continue for the whole night...)
photo_2020-06-01_00-40-57.jpg (205.9 KB ) - added by TheClue 4 years ago.
Invalid opcode exception (again, using the context menu)
photo_2020-06-02_00-02-45.jpg (274.8 KB ) - added by TheClue 4 years ago.
Other thread/1 trapped in kernelspace: empty_magazine
photo_2020-06-02_00-04-07.jpg (180.2 KB ) - added by TheClue 4 years ago.
Other thread/2 trapped in kernel space: map_page

Download all attachments as: .zip

Change History (24)

by TheClue, 4 years ago

KDL 1

comment:1 by TheClue, 4 years ago

Description: modified (diff)

comment:2 by TheClue, 4 years ago

Description: modified (diff)

by TheClue, 4 years ago

Again, using the contextual menu

comment:3 by waddlesplash, 4 years ago

If you see the "Invalid Opcode Exception" again, please run "sc -d" at the KDL prompt and take a photo of that.

comment:4 by waddlesplash, 4 years ago

Cc: mmlr added
Keywords: kdl removed
Platform: x86-64All

mmlr's recent changes are the most likely culprit. What hrev are you running, TheClue?

in reply to:  4 comment:5 by TheClue, 4 years ago

Replying to waddlesplash:

mmlr's recent changes are the most likely culprit. What hrev are you running, TheClue?

hrev54278

by TheClue, 4 years ago

Another KDL. Again, when using the contextual menu for a StyledEdit file

comment:6 by mmlr, 4 years ago

Interesting. The message in the last screen is cut off, I presume it was again a "General Protection Exception"?

It's not the same exception, but both would be something you get from executing bad memory. That would suggest that some memory range got mis-mapped or some caches aren't properly invalidated.

It is somewhat suspicious that you have it twice in a kernel side allocation right during page fault handling. The possibly related changes are mostly on the user address space handling, so even if there was a grave error that messes up the user address space, it shouldn't fault in the kernel afterwards.

Maybe the AreaRangeIterator doesn't iterate properly, but I re-reviewed it and I don't see how that would happen.

Could also be the VMCache::_FreePageRange() refactor freeing something it shouldn't, but in the worst case you should get new page faults, not GPEs or invalid opcodes.

Generally all the changes are hit pretty commonly so it's odd that I didn't see anything in my testing. I've stress tested these pretty well in various memory conditions and use cases. Maybe it exposes a preexisting race condition that is now more likely because LookupArea is faster or some such. What hardware is this on?

#16168 may be related, but there's not enough information there to know.

by TheClue, 4 years ago

Attachment: syslog added

syslog, business as usual here

by TheClue, 4 years ago

Attachment: listdev.txt added

listdev

by TheClue, 4 years ago

contextual menu...again...

by TheClue, 4 years ago

sc -d output (i think i could continue for the whole night...)

by TheClue, 4 years ago

Invalid opcode exception (again, using the context menu)

comment:7 by X512, 4 years ago

Can you run checkfs -c /boot?

comment:8 by waddlesplash, 4 years ago

There's no good reason to suspect disk corruption from these KDLs. mmlr's analysis above also has no reason to suspect it. Additionally, based on mmlr, TheClue, and I's discussion on IRC, it seems that this is some kind of race.

comment:9 by TheClue, 4 years ago

Finally I was able to get a GPE that could be more useful for the diagnosis. In facts, both the other two running threads were trapped in kernel space. Screenshot attached.

by TheClue, 4 years ago

Other thread/1 trapped in kernelspace: empty_magazine

by TheClue, 4 years ago

Other thread/2 trapped in kernel space: map_page

comment:10 by mmlr, 4 years ago

I happened to be able to find this with the kernel guarded heap and it turned out to be an off-by-one error and should be fixed by this change:

https://review.haiku-os.org/c/haiku/+/2860

So it indeed was the "open with" context menu, due to it running a query.

As for why it was so well reproducible on your end, that was probably related to the length of the query. It would need to be a certain length to actually run over the allocation due to implicit alignment and slab sizes. The query length depends on the installed apps that claim support for the mimetype.

It is still somewhat curious that it apparently would always corrupt the page mapping object cache structures. Obviously the result of corrupting random kernel heap data is pretty unpredictable.

comment:11 by waddlesplash, 4 years ago

Change merged in hrev54284. TheClue, please retest under that.

comment:12 by TheClue, 4 years ago

hrev54287, problem seems solved here. I'll keep testing during regular use, though.

Lemme congratulate to you all. A so nasty, deeply buried, exotic bug spot and fixed in less than 2 days. That's magic :O

BTW, I guess the other KDL i suffered yesterday (last two screenshots provided) is not related to this (it happened during the shutdown)

Last edited 4 years ago by TheClue (previous) (diff)

comment:13 by waddlesplash, 4 years ago

Resolution: fixed
Status: newclosed

No, your KDL pictures from yesterday look like the same thing to me anyway. Excellent!

comment:14 by nielx, 4 years ago

Milestone: UnscheduledR1/beta2

Assign tickets with status=closed and resolution=fixed within the R1/beta2 development window to the R1/beta2 Milestone

(final time)

Note: See TracTickets for help on using tickets.