Opened 9 months ago

Last modified 6 months ago

#9858 assigned bug

Ripping CDs -> page fault panic

Reported by: ttcoder Owned by: nobody
Priority: high Milestone: R1/beta1
Component: System/Kernel Version: R1/Development
Keywords: slab Cc:
Blocked By: Blocking:
Has a Patch: no Platform: All

Description

hrev45824

Our CD-ripping/encoding application is more stable with recent hrevs, but we still managed to crash the kernel. It's getting harder to reproduce though, this time it occured after ripping 7 CDs, for dsuden.

Attachments (8)

panic_while_ripping_CDs_#9858.JPG (621.7 KB) - added by ttcoder 9 months ago.
KDL on the seventh CD rip
oneCD_panic_DSC03323.JPG (168.3 KB) - added by ttcoder 9 months ago.
Toward the end of first CDDA rip (panic from Tracker thread)
oneCD_lots-of-vnodes.jpg (210.4 KB) - added by ttcoder 9 months ago.
One-CDDA panic: there were 46'000 vnodes
oneCD_dozens-of-pages-listing.jpg (101.5 KB) - added by ttcoder 9 months ago.
One-CDDA panic: endless listing
panic_DisabledSMP.jpg (806.4 KB) - added by ttcoder 9 months ago.
Disabling SMP makes it more difficult to panic, but it's still possible
panic_DisabledSMP.png (1.2 MB) - added by ttcoder 9 months ago.
Take two, as a PNG (previous was a JPEG2000 generated in hrev45824)
15CDs_corrupted_BKernel_list_quitting-encoder.jpg (228.6 KB) - added by ttcoder 6 months ago.
Winding down after the 15 CDs encoding session
dsuden_f2a55m_hrev46188__shifted_0xdeadbeef!.jpg (607.3 KB) - added by ttcoder 6 months ago.
Different KDL message and (interestingly) the bad address is -> 0xdeadbeef <- !

Change History (31)

Changed 9 months ago by ttcoder

KDL on the seventh CD rip

comment:1 Changed 9 months ago by anevilyak

  • Component changed from - General to System/Kernel
  • Keywords vm iocache added
  • Owner changed from nobody to bonefish
  • Status changed from new to assigned

comment:2 Changed 9 months ago by anevilyak

  • Keywords iocache removed

comment:3 Changed 9 months ago by kallisti5

  • Milestone changed from R1 to R1/beta1

setting beta1. I (as well as quite a few others i'm guessing) would like to see these vm / memory management issues solved before the next alpha / beta release. As package management is coming, these kinds of things will stand out more and more. (not that I would know how to fix them though _)

comment:4 Changed 9 months ago by bonefish

  • Keywords slab added; vm removed
  • Owner changed from bonefish to nobody

Looks like someone has clobbered the slab allocator's internal data structures. Unfortunately not much more can be said with the available information.

comment:5 Changed 9 months ago by kallisti5

  • Priority changed from normal to high

Changed 9 months ago by ttcoder

Toward the end of first CDDA rip (panic from Tracker thread)

Changed 9 months ago by ttcoder

One-CDDA panic: there were 46'000 vnodes

Changed 9 months ago by ttcoder

One-CDDA panic: endless listing

comment:6 Changed 9 months ago by ttcoder

Tried it here and managed to panic toward the end of the very first CD rip, as I was not so gentle as dsuden (MediaPlayer playing music in background, toying with the tracker "Get Info" panel of a header file ..etc). This time the kernel crash comes from a different userland thread than the ripping one (tracker get-info panel) but ends up with a VM pagefault panic too. Historically, using Tracker at the same time as the rip goes on, has been a relatively surefire way to trigger a panic.

Not sure if that's useful, but I toyed a bit with KDL not knowing what I was doing... and realized there was an awful lot of listed vnodes (46'000).. could a resource leak in the application possibly be related to the panic ? File Descriptors maybe ?

comment:7 Changed 9 months ago by anevilyak

Since this only happens during ripping, the more likely culprit is a bug in cddafs itself.

comment:8 follow-up: Changed 9 months ago by ttcoder

That certainly leaves us some room to manoeuver: we plan to try reducing usage of cddafs as much as possible (in case a different usage pattern no longer triggers the bug), and if that still does not work we'll remove it from the equation entirely and read the CD data another way. I tried the /bin/cdda2wav command but it crashes when reading the very first track (seems to be an old alpha3 build so maybe that's the reason it does not work on recent nightlies?), so I'll have to replicate its behavior, hacking in the /dev/atapi hierarchy.

Dane got another KDL today, trying the strategy 1 outline up here (reducing usage of cddafs to straight copying the files in one pass). The photo is blurry and the first steps of the backtrace are the same, so no point in posting it I suppose. The last step is a failed assert in X86VMTranslationMapPAE.cpp:231

comment:9 in reply to: ↑ 8 Changed 9 months ago by anevilyak

Replying to ttcoder:

Dane got another KDL today, trying the strategy 1 outline up here (reducing usage of cddafs to straight copying the files in one pass). The photo is blurry and the first steps of the backtrace are the same, so no point in posting it I suppose. The last step is a failed assert in X86VMTranslationMapPAE.cpp:231

That would be ticket:8345.

comment:10 Changed 9 months ago by ttcoder

  • Summary changed from Ripping CDs -> page fault panic (via sscanf() in ffmpeg) to Ripping CDs -> page fault panic

So #8345 leads to #7904 and to 1) trying to use only one core and 2) try to reproduce the TranslationMapPAE crash with installoptionalpackage. I've never had any problem with installoptionalpackage on any hrev, so I focused on 1), which was more fruitful -- documenting today's extensive testing:

I set out to test 2x2 tactics: disabling my second CPU core; and copying files to HDD and rip from there, rather than rip directly from CD.

The result is that I need both tactics to avoid crashing:

  • neither strategy in use: KDL, as established before.
  • second core disabled, ripping directly from CD: KDL.
  • second core disabled, two-stage ripping: I was good for half a dozen CDs!
  • re-enabled the second core, continued two-stage ripping: KDL'ed at the end of the very current in-progress CD rip, when quitting MediaPlayer.

Both KDLs were of the page fault panic-in-kernel sort. The first KDL had a very short backtrace (Addon, x86_int, page_fault, panic/trampoline), triggered when I invoked the filetypes tracker add-on. Second was triggered when quitting MediaPlayer and the backtrace involved BKernel::Thread::Deletetimers so it seems again a kernel structure was conpromised. Neither backtrace seemed worth grabbing for posting.

Now that I'm sure how to crash the thingy, i'm going to focus on not crashing :-) , i.e. confirm on 10 or 20 CDs that it indeed won't crash with a core disabled and two-stage file processing. If it won't crash for 20 CDs, then I'll just have to code that work-around in my app, in theory it should work: disable all but one cores in BApp ctor, re-enable in BApp dtor (using again _kern_set_cpu_enabled(..)), and copy files to HDD to encode from there.

comment:11 follow-up: Changed 9 months ago by bonefish

I assume you didn't actually try to disable SMP (in the boot loader or via kernel settings)? The disable CPU feature just tells the scheduler not to schedule any threads but the idle thread on that CPU. Interrupts can still be processed on it. So it's not like disabling all but one CPU actually disables SMP.

Moreover it is entirely possible (maybe even likely) that CD ripping doesn't actually trigger the bug, but just makes it visible. Reading the whole CD simply causes a large cache turnover (essentially the whole CD is read into RAM). So if some paging related structures got corrupted or a page was erroneously marked free before, using a lot of pages makes it rather likely to run into the issue. So even if SMP could actually be disabled live, running into the problem after doing that doesn't necessarily mean that isn't purely SMP related.

Long story short, it would be nice if you could test with SMP disabled.

comment:12 in reply to: ↑ 11 Changed 9 months ago by korli

Replying to bonefish:

I assume you didn't actually try to disable SMP (in the boot loader or via kernel settings)? The disable CPU feature just tells the scheduler not to schedule any threads but the idle thread on that CPU. Interrupts can still be processed on it. So it's not like disabling all but one CPU actually disables SMP.

FWIW MSI interrupts seem to only target the first CPU (see here the destination).

Changed 9 months ago by ttcoder

Disabling SMP makes it more difficult to panic, but it's still possible

comment:13 Changed 9 months ago by ttcoder

Ingo, long story short -- SMP makes a bit of difference quantatively, but not qualitatively, Also I did a "page_stats" in case you want to take a look.

@korli & bonefish: in case that matters, when I toyed with Pulse to disable a core I always disabled the second one, leaving only the first CPU running. As to the disabled SMP case, I notice Pulse and ProcessController behave as if there is only one CPU in the system at all, so I dunno which it is (probably same situation, it's CPU # 0 that's running and handling interrupts?).

The long version:
After Ingo's explanation I rebooted and picked "Disable SMP" in the boot menu. Ended up with only one CPU gauge in the deskbar ..etc as should be. Started ripping, reading the news, listening to ripped tracks (quitting/restarting/quitting/restarting MediaPlayer each time, so as the "mutate" the kernel structures as much and as often as possible), doing Alt-A select all and "Get Info" in tracker ..etc. After 4 CDs the thing was still running like a champ so I started to feel excited, "wow this test is hammering the machine in an unprecedented way yet it won't crash with SMP disabled, so we have uncovered an SMP-specific bug in Haiku"-style. On the fifth CD it KDL'ed though, sigh! So SMP is not the culprit it seems.

I've seen in the "sister" tickets that Ingo recommended using "page_stats" too, so I did (see attached screen grab).

Also I noticed this in my syslog, but it's there even when not ripping CDs so probably not significant: mark_page_range_in_use(0x0, 0xa0): start page is before free list

comment:14 Changed 9 months ago by anevilyak

Your image attachment appears to be broken.

Changed 9 months ago by ttcoder

Take two, as a PNG (previous was a JPEG2000 generated in hrev45824)

comment:15 Changed 9 months ago by anevilyak

That would be why, for future reference, the web standard is normal JPEG, it's not interchangeable or compatible with JPEG2000.

comment:16 Changed 6 months ago by ttcoder

Realized after the fact that W+ (like N+), not being native, it does not use the haiku Translators to decode images... Not that it should necessarily do so either.

Status update:

hrev46188 (which includes bonefish's stability fix(es)), testing on the asus f2a55m system: I'm unable to trigger any misbehaving. I'm now north of 15 audio CDs encoding. We will continue testing in different conditions in the coming weeks. Will also crosspost to #8345 (EDIT: <strike>8545</strike>) if further unable to reproduce either type of crash, which would be great news.

Last edited 6 months ago by ttcoder (previous) (diff)

comment:17 Changed 6 months ago by bonefish

The recent fix for #8545 doesn't fix anything serious and I don't see how it would be related to the crashes in this ticket. Those seem to be caused by memory corruption.

Changed 6 months ago by ttcoder

Winding down after the 15 CDs encoding session

comment:18 Changed 6 months ago by ttcoder

(fixed the 'peer' ticket number to 8345 in my previous message, my bad)

Guess you're right bonefish, seems the problem is still there in 46188.. I actually got a KDL a while after writing my triumphant comments this morning.. It was after everything was completed and I did Alt-Q on TP, the ripping application. Attached: a bad pointer (?) in the kernel UserTimerList stumbled upon as the app's Open filepanel was being destructed.

It's a bit reminiscent of my previous encouters with this, which occured in the middle (rather than the end) of my ripping session, as I started/quitted MediaPlayer or Tracker windows as the ripping occured in the background.

Maybe this time the kernel structure got corrupted earlier and I just dodged the change until the end as I didn't quit the impacted app until the end.

Next I'll test on my old Core2 development laptop, where the corruption seems a bit easier to trigger, and I'll remember to start/quit some new threads and teams during the ripping time. If I find an "easy" way to trigger the KDL that'll make me courageous enough to find a method to the madness :-) again and go "divide and conquer".. divide the problem in terms of drivers being used, steps involved in the ripping process (CD access, LAME pipe() and so on).

P.S. 0x4120646d translates to 'A dm' in ASCII, what for it's worth.. Which is probably not much, since previous KDLs did not have ascii contents in the page-faulty addresses that I remember...

Changed 6 months ago by ttcoder

Different KDL message and (interestingly) the bad address is -> 0xdeadbeef <- !

comment:19 Changed 6 months ago by ttcoder

Possible breakthrough warning (if we're lucky). Please take a look and tell me if my expectations are inflated or not :-)

I enrolled dsuden in the test with a recent nightly, he had a much easier time than me reproducing the KDL. First two backtraces/message are similar, "unhandled page fault in kernel space". The third one made me fall off my chair though. Here it is typed to text for emphasis:

PANIC: page fault, but interrupts were disabled. Touching address 0xadbeefe2 from ip 0x8010287f

If it gets the spidey sense tickling in a userland guy like me, maybe it will really interest kernel-savvy people :-)

So why is there a shifted/mangled deadbeef in there.. Seems related for sure to the ecx register, with an almost "clean" deadbeef value. Seems like there were two 0xdeadbeef values back to back, and list_remove_link() tried to access an address across the two.

IIRC 0xdeadbeef is written when a pointer is free()d (or whatever the kernel-land equivalent of free() is), is that right ? That should get us closer to finding the problem, with some luck ?

Also gotta note that the KDL message is similar to #7889 rather than to this ticket.. But that latter ticket's backtrace is very different, so for now I'm filing this .jpeg here. I could even file a third ticket.. Might make sense since this KDL smells of a different problem than the others, possibly related to heap/malloc corruption instead of plain old memory corruption ?

Feedback eagerly anticipated, crossing fingers to not be like "actually that deadbeef thing does not tell us much useful stuff at all, continue testing some more". *g*

comment:20 Changed 6 months ago by bonefish

The kernel free() does indeed fill the freed memory with "deadbeef". I don't think that is relevant in this case, though. From the back trace and the corresponding disassembly one can infer that in list_remove_link() the line

link->next->prev = link->prev;

crashes. At this point link is in eax and link->next in ecx. Since ecx is a bad pointer, the

mov    %edx,0x4(%ecx)

instruction causes a page fault. The reason why the "deadbeef" in ecx is rotated, is that eax is not 4-byte aligned. link in this case refers to the head element of the Team::dead_threads list. The list element pointers should always be properly aligned. My conclusion is that either the list anchor (i.e. part of the Team structure) or one of the list elements has been corrupted.

comment:21 Changed 6 months ago by ttcoder

I think I understand.. The memory-corrupting bad guy did not necessarily write "deadbeef" in the chained-list itself. Maybe it wrote something else, that happened to be in a valid page, so the linked-list routine followed that corrupted (but memory-valid) pointer and read what it thought was the next linked-list item at that address, which in that very place, contained deadbeef by sheer chance. That is consistent with the fact that I only ever saw one KDL with "deadbeef" in it thus far.

So we still have no clue on who the bad guy is yet.

But I'm making (slow) progress on crafting a reproducible case though: Luckily I've found some steps to reproduce the KDL quickly, in just 2 CDs. I've had 3 times the same KDL in 3 attempts (with corrupted address = 0xb and instruction pointer IP = 0xb, called from userland stop_watching, each single time). So I now consider it triggerable reliably. My next steps will be to reduce the list of steps needed to trigger it, and then make these steps "portable" to you bonefish et alia.

Last edited 6 months ago by ttcoder (previous) (diff)

comment:22 Changed 6 months ago by ttcoder

Just realized that my "new" KDL is actually very similar to #9528 ; time will tell if this one and that one are two separate issues or not..

comment:23 Changed 6 months ago by ttcoder

Just got two (grainy, hard to read) KDL photos from customers, similar to this ticket's.

In both cases the KDL was triggered seemingly without involving cdda-fs at all: it was SoundPlay reading a hard-drive file, ending up in VMCache::LookupPage(), which paniced on a bad pointer.

As to me, I'm knocking my head against walls as I can't reproduce the KDL with the supposed fail-safe steps I mentionned above, sigh :-( . Will try some more.

EDIT both clients seemingly use the old Alpha4.1 hrev though

Last edited 6 months ago by ttcoder (previous) (diff)
Note: See TracTickets for help on using tickets.