Opened 10 years ago
Last modified 4 years ago
#11744 assigned bug
hrev48411: x86VMTranslationMapPAE.cpp:422, *entry&1 == 0
Reported by: | ttcoder | Owned by: | bonefish |
---|---|---|---|
Priority: | normal | Milestone: | R1.1 |
Component: | System/Kernel | Version: | R1/Development |
Keywords: | Cc: | ||
Blocked By: | Blocking: | #14157, #14318 | |
Platform: | x86 |
Description
Got this report from a customer, might be worth filing since this occured post hrev48123, in 48411 ; creating a new ticket as per ticket:8345#comment:83
Attachments (11)
Change History (37)
comment:1 by , 10 years ago
comment:2 by , 10 years ago
Component: | - General → System/Kernel |
---|---|
Owner: | changed from | to
Status: | new → assigned |
Are you able to get any hardware information from that customer for the system this was seen on? A listdev and a syslog from a fresh boot would both be nice, since it'd be helpful to know both the general hardware that's present and things like CPU + amount of RAM.
by , 10 years ago
Attachment: | listdev.txt added |
---|
by , 10 years ago
Attachment: | previous_syslog added |
---|
Looks like this got manhandled by VNC server, but still usable
comment:3 by , 10 years ago
Got a similar one from a user with hrev48366 -- so less recent, but still above the 48123 threshold.
We disabled the background script running that mkdir command, that should 'fix' it for him.
Edit: while inspecting one of the concerned systems, found a script that recursively calls itself (!).. So presumably creating a new /bin/bash instance every time and never letting go of anything. 6 times an hour equals 140+ times a day, I'm guessing that could cause trouble over a few days *g*.. Anyway replaced the self-call with a proper infinite loop.
comment:4 by , 10 years ago
As to the first KDL referred to in comment:1, dsuden and the station are growing confident that it's likely related to HDA line-input recording: that station (WRN) is using time-triggered recording at very high intensity (dozens of recordings per day IIUC), and they only ever see the KDL crash occur on (some of the) recording start-ups (so that would be a setup with TimeTracker 2.0.7 and hrev48411).
Dunno how the HDA driver/recording could be related to PAE paging... other than its sharing the kernelland address space of course. Anyway hope that inspires..
Edit: @dsuden: I'm curious to see more instances/KDL screenshots though, especially since the one above refers to "query" so probably in one of the background scripts that was running, and we decided to disable/modify the query-based shell script so what happens now.. The crash occurs somewhere else ?
comment:5 by , 10 years ago
VM page faults + HDA line input + KDLs = sounds like the bug that Julian tried to fix in this commit. His exact words were:
On some boots it was unlucky and then the HDA controller's DMA engine would go haywire and start overwriting random kernel memory with audio input data. The scheduler was usually the first victim and caused page fault KDLs due to corrupted data structures.
However, he notes in that commit that he only got these problems on boot. Perhaps more, similar issues remain in the code that don't happen 100% of the time and so aren't noticed as often...?
comment:6 by , 10 years ago
That's a compelling possibility, good catch Augustin -- the reason being, we run our own version of the HDA driver (wish I could merge at least some of the changes into Haiku some day) and I don't sync the code with "trunk" very often, so it's very possible they're running a pre-fix driver, hosted in a more recent Haiku hrev.
Asked them to try to revert to "stock" hda temporarily to test that theory, but apparently they're not able/wiling to do so :-/
EDIT: just realized that Julian's fix includes a dprintf("DMA snooping..") that's easy to grep for in the syslog, I'll ask them to grep for it, we'll see how recent their driver is then.
comment:7 by , 10 years ago
Updates..
- for the W.R.N. (hrev48411) station with HDA suspected, turns out our driver does include jua's changes indeed:
KERN: hda: using MSI vector 26 KERN: hda: DMA snooping: no
- posting a KDL below: occured in thread "sh" so apparently they still have some script running.. I've asked dsuden about that.
by , 10 years ago
Attachment: | (WRN,48411,VMCache.LookupPage)IMG953628.jpg added |
---|
Possibly off topic for this ticket (VMCache.LookupPage() crash, not in PAE code)
comment:8 by , 10 years ago
Got some fresh data from WRN after they got a KDL during an FTP access. Two things of note: BFS errors before, and a resource warning during the FTP access. Can post the whole thing if useful. First the BFS errors:
KERN: slab memory manager: created area 0xf2801000 (895377) KERN: bfs: InitCheck:323: Bad data KERN: bfs: inode at 2097162 is already deleted! KERN: bfs: GetNextMatching:803: Bad data KERN: bfs: KERN: could not get inode 2097162 in index "size"! KERN: bfs: InitCheck:323: Bad data KERN: bfs: KERN: inode at 2097162 is already deleted! KERN: bfs: GetNextMatching:803: Bad data KERN: bfs: could not get inode 2097162 in index "size"! KERN: bfs: InitCheck:323: Bad data KERN: bfs: inode at 2097162 is already deleted! (.. about one page high like this ..)
The FTP traffic that was ongoing at the time of the crash looks a tad more harmless:
FTP 'ftpd'[640338]: data setsockopt (TCP_NOPUSH): Invalid Argument KERN: libnetwork.so running in R5 compatibility mode. (..) FTP 'ftpd'[640338]: data setsockopt (TCP_NOPUSH): Invalid Argument KERN: libnetwork.so running in R5 compatibility mode. Last message repeated 1 time KERN: slab memory manager: created area 0xcb001000 (51580897) KERN: slab memory manager: created area 0xcb801000 (51895142) KERN: slab memory manager: created area 0xcc001000 (52211306) KERN: low resource address space: note -> warning KERN: libnetwork.so running in R5 compatibility mode. Last message repeated 4 times KERN: bfs: bfs_open_dir:1644: Not a directory KERN: options = 0 KERN: No APM available. KERN: smp: using ACPI to detect MP configuration (big snip) KERN: <DROP>Welcome to syslog debug output! KERN: Haiku revision: hrev48411
Further down I see what looks like another crash, preceded by lots of "slab" chatter.. Is it indicative of them running out of memory ? (we've had reports from another station of SP gobbling up heaps of memory so I'd better ask):
KERN: slab memory manager: created area 0xf6801000 (3352419) KERN: slab memory manager: created area 0xf7001000 (3594514) KERN: slab memory manager: created area 0xf7801000 (3888898) KERN: slab memory manager: created area 0xf8001000 (4203155) KERN: slab memory manager: created area 0xf8801000 (4518017) KERN: slab memory manager: created area 0xf9001000 (4832886) KERN: slab memory manager: created area 0xf9801000 (5151324) KERN: slab memory manager: created area 0xfa001000 (5459653) KERN: slab memory manager: created area 0xfa801000 (5774577) KERN: slab memory manager: created area 0xfb001000 (6067400) KERN: slab memory manager: created area 0xfb801000 (6385319) KERN: slab memory manager: created area 0xfc001000 (6698986) KERN: slab memory manager: created area 0xfc801000 (7015105) KERN: options = 0 KERN: No APM available.
comment:9 by , 10 years ago
Edit: Never mind, misread, that's the end of the previous run leading into a fresh boot.
by , 10 years ago
Attachment: | slab_syslog.old added |
---|
Potential pattern of slab allocs. preceding crashes
by , 10 years ago
Attachment: | slab_syslog added |
---|
Potential pattern of slab allocs. preceding crashes
comment:10 by , 10 years ago
Here's the whole thing; I do see a bunch of slab allocs preceding 2 of the 3 crashes, but don't know what to make of that.
Dsuden informed me there is still a script running (sigh), it's a deep-recursive /bin/find that runs addattr on hundreds of files, every 10 minutes or some such..
by , 9 years ago
Attachment: | (49137)IMG_4414.JPG added |
---|
hrev49137, sh thread, syslog mentions seemingly minor USB errors previous to crash
comment:11 by , 9 years ago
A station just had this in 49137. Of note -- I also see a (truncated) "bash-4039-debug-03-11-2015-13-20-19.report" report (looks like a segv in terminate_program()) on their desktop, not sure if it occured at the same time or not.
All in all, a rather rare occurence
Edit: okay, as its filename implies, the bash userland crash occured on Nov. 3rd, not today (Nov. 5th).
by , 9 years ago
Attachment: | (49856)20160404_124250.jpg added |
---|
by , 9 years ago
Attachment: | (49856)20160404_124257.jpg added |
---|
comment:12 by , 9 years ago
(note to self: that split screenshot is from BR radio, they do /not/ have timetracker installed)
First I see this in a long time, and this might be the first time I see this KDL in hrev49856 (by opposed to 49137 which was previously in production for us)
Some oddities in their syslogs, including the usual "ARP address mismatch" stuff but especially: a series of dozens of crashes in "ls", "dirname" and the such
Here's some hardware/SW info in case it's (still) useful:
Haiku shredder 1 hrev49856 Nov 19 2015 01:23:00 BePC x86 Haiku 2 AMD A-Series, revision 610f31 running at 3693MHz CPU #0: "AMD A4-6300 APU with Radeon(tm) HD Graphics " 2848808960 bytes free (used/max 580960256 / 3429769216) (cached 2535731200) 63850 semaphores free (used/max 1686 / 65536) 3877 ports free (used/max 219 / 4096) 3892 threads free (used/max 204 / 4096) 2004 teams free (used/max 44 / 2048)
Memory usage is high for some reason in these teams: registrar: 41418752 bytes deskbar: 29163520 a few bash scripts at 10 MB+ each
For what it's worth
comment:13 by , 8 years ago
Platform: | All → x86 |
---|
comment:14 by , 8 years ago
Maybe it would be worth trying to switch to X86PagingMethod32Bit? The kernel drivers setting is "4gb_memory_limit". Uncomment to enable.
comment:15 by , 7 years ago
This happend today on a x86_gcc2 builder in a kvm VM on maui:
PANIC: ASSERT FAILED (/home/builder/builds/haiku/src/system/kernel/arch/x86/pagi ng/pae/X86VMTranslationMapPAE.cpp:422): (*entry & 0x0000000000000001LL) == 0; vi rtual address: 0x111d000, existing pte: 0x36aecee28eb9a675 @ 0x81e658e8 Welcome to Kernel Debugging Land... Thread 151745 "sh" running on CPU 1 stack trace for thread 151745 "sh" kernel stack: 0xcdf92000 to 0xcdf96000 user stack: 0x71402000 to 0x72402000 frame caller <image>:function + offset 0 cdf95c20 (+ 32) 8014f16e <kernel_x86> arch_debug_stack_trace + 0x12 1 cdf95c40 (+ 16) 800a9d6f <kernel_x86> stack_trace_trampoline(NULL) + 0x0b 2 cdf95c50 (+ 12) 80140ede <kernel_x86> arch_debug_call_with_fault_handler + 0x1b 3 cdf95c5c (+ 48) 800ab892 <kernel_x86> debug_call_with_fault_handler + 0x5a 4 cdf95c8c (+ 64) 800a9f8b <kernel_x86> kernel_debugger_loop(ESC[34m0x80193497ESC[0m ESC[36m"PANIC: "ESC[0m, ESC[34m0x801b5fa0ESC[0m ESC[36m"ASSERT FAILED ( %s:%d): %s; virtual address: %#lx, existing pte: %#llx @ %p"ESC[0m, ESC[34m0xcdf95d38ESC[0m ESC[36m"@_ESC<80><A6>^A"ESC[0m, int32: ESC[34m1ESC[0m) + 0x217 5 cdf95ccc (+ 48) 800aa307 <kernel_x86> kernel_debugger_internal(ESC[34m0x80193497ESC[0m ESC[36m"PANIC: "ESC[0m, ESC[34m0x801b5fa0ESC[0m ESC[36m"ASSERT FAILED (%s:%d): %s; virtual address: %#lx, existing pte: %#llx @ %p"ESC[0m, ESC[34m0xcdf95d38ESC[0m ESC[36m"@_ESC<80><A6>^A"ESC[0m, int32: ESC[34m1ESC[0m) + 0x53 6 cdf95cfc (+ 48) 800abc1e <kernel_x86> panic + 0x3a 7 cdf95d2c (+ 80) 80149a3e <kernel_x86> X86VMTranslationMapPAE<ESC[32m0xda857d18ESC[0m>::Map(uint32: ESC[34m0x111d000ESC[0m, uint64: ESC[34m0x81462000ESC[0m, uint32: ESC[34m0x11ESC[0m (ESC[34m17ESC[0m), uint32: ESC[34m0x0ESC[0m (ESC[34m0ESC[0m), vm_page_reservation*: ESC[34m0xcdf95e9cESC[0m) + 0x14e 8 cdf95d7c (+ 80) 8011e7d9 <kernel_x86> map_page(VMArea*: ESC[34m0xb53a1890 ESC[0m, vm_page*: ESC[34m0x84a3efe8ESC[0m, uint32: ESC[34m0x111d000ESC[0m, uint32: ESC[34m0x11ESC[0m (ESC[34m17ESC[0m), vm_page_reservation*: ESC[34m0xcdf95e9c ESC[0m) + 0xf1 9 cdf95dcc (+ 224) 80126875 <kernel_x86> vm_soft_fault(VMAddressSpace*: ESC[34m0xb72f7d38ESC[0m, uint32: ESC[34m0x111d000ESC[0m, ESC[34mfalseESC[0m, ESC[34mfalseESC[0m, ESC[34mtrueESC[0m) + 0x641 10 cdf95eac (+ 160) 80125925 <kernel_x86> vm_page_fault + 0xb5 11 cdf95f4c (+ 80) 801509b7 <kernel_x86> x86_page_fault_exception + 0x17f 12 cdf95f9c (+ 12) 8014391e <kernel_x86> int_bottom_user + 0x73 user iframe at 0xcdf95fa8 (end = 0xcdf96000) eax 0x250c1 ebx 0x111cfdc ecx 0x723ff95c edx 0x6159d114 esi 0x0 edi 0x0 ebp 0x723ff988 esp 0xcdf95fdc eip 0x10e9420 eflags 0x13246 user esp 0x723ff960 vector: 0xe, error code: 0xd 13 cdf95fa8 (+ 0) 010e9420 <libroot.so> fork + 0x98 14 723ff988 (+ 64) 0155d039 <sh> make_child + 0x11d 15 723ff9c8 (+ 96) 0154c8ef <sh> coproc_unsetvars (nearest) + 0x327b 16 723ffa28 (+ 128) 015478be <sh> execute_command_internal + 0x7d6 17 723ffaa8 (+ 128) 01549bc0 <sh> coproc_unsetvars (nearest) + 0x54c 18 723ffb28 (+ 96) 0154a1c1 <sh> coproc_unsetvars (nearest) + 0xb4d 19 723ffb88 (+ 128) 01547bc8 <sh> execute_command_internal + 0xae0 20 723ffc08 (+ 80) 015996df <sh> parse_and_execute + 0x43f 21 723ffc58 (+ 96) 0156bb75 <sh> command_substitute + 0x47d 22 723ffcb8 (+4576) 01571c42 <sh> invalidate_cached_quoted_dollar_at (nearest) + 0xae6 23 72400e98 (+ 64) 015678cd <sh> cond_expand_word (nearest) + 0x10d 24 72400ed8 (+ 80) 01567b18 <sh> expand_string_assignment + 0xb0 25 72400f28 (+2272) 015671d0 <sh> string_rest_of_args (nearest) + 0x304 26 72401808 (+ 80) 015669d4 <sh> strip_trailing_ifs_whitespace (nearest) + 0x418 27 72401858 (+ 48) 01566c65 <sh> do_word_assignment + 0x21 28 72401888 (+ 64) 0157413b <sh> expand_words_shellexp (nearest) + 0xa6b 29 724018c8 (+ 48) 015736a1 <sh> expand_words + 0x21 30 724018f8 (+ 96) 0154ca52 <sh> coproc_unsetvars (nearest) + 0x33de 31 72401958 (+ 128) 015478be <sh> execute_command_internal + 0x7d6 32 724019d8 (+ 80) 01546f02 <sh> execute_command + 0x62 33 72401a28 (+ 80) 0154a0ef <sh> coproc_unsetvars (nearest) + 0xa7b 34 72401a78 (+ 128) 01547bc8 <sh> execute_command_internal + 0xae0 35 72401af8 (+ 80) 01546f02 <sh> execute_command + 0x62 36 72401b48 (+ 80) 0154a0ef <sh> coproc_unsetvars (nearest) + 0xa7b 37 72401b98 (+ 128) 01547bc8 <sh> execute_command_internal + 0xae0 38 72401c18 (+ 80) 01546f02 <sh> execute_command + 0x62 39 72401c68 (+ 80) 0154a0ef <sh> coproc_unsetvars (nearest) + 0xa7b 40 72401cb8 (+ 128) 01547bc8 <sh> execute_command_internal + 0xae0 41 72401d38 (+ 80) 01546f02 <sh> execute_command + 0x62 42 72401d88 (+ 64) 0154a69d <sh> coproc_unsetvars (nearest) + 0x1029 43 72401dc8 (+ 112) 01547a64 <sh> execute_command_internal + 0x97c 44 72401e38 (+ 80) 01546f02 <sh> execute_command + 0x62 45 72401e88 (+ 64) 01535fd1 <sh> reader_loop + 0x289 46 72401ec8 (+ 80) 01533b73 <sh> main + 0x94b 47 72401f18 (+ 48) 0153310f <sh> _start + 0x5b 48 72401f48 (+ 48) 00a15cfe </boot/system/runtime_loader@0x00a04000> <unknown> + 0x11cfe 49 72401f78 (+ 0) 6159d250 <commpage> commpage_thread_exit + 0x00
This happened after around 40 haikuporter builds. The VM is configured with 6 GiB of RAM. The machine didn't take any input anymore, so wasn't able to check anything in KDL.
comment:16 by , 7 years ago
Blocking: | 14157 added |
---|
comment:17 by , 6 years ago
Blocking: | 14318 added |
---|
comment:18 by , 6 years ago
FWIW, I tried to fix the TODO in fork_team about not wanting to copy-on-write the heap by doing a lock/unlock memory:
- if (info.area == parentThread->user_stack_area) + if (info.area == parentThread->user_stack_area) { thread->user_stack_area = area; + // Locking and then immediately unlocking the stack memory has + // the effect of fully wiring it, which is much faster than + // letting it get copied on write. + lock_memory_etc(team->address_space->ID(), address, + info.size, 0); + unlock_memory_etc(team->address_space->ID(), address, + info.size, 0); + }
However, this caused a KDL on fork() at startup (done in sh, even) about not being able to unwire pages. Probably a separate issue, but, maybe related.
by , 5 years ago
Attachment: | IMG_5782.jpg added |
---|
Dane trying to come up with a stable system -- but he had this KDL (after FetchIt started being used?)
comment:19 by , 5 years ago
Not opening a new ticket since the KDL Dane had today seems to fit this long-standing one...
@waddlesplash you asked me off-ticket to explain what issues prevent us from making sales.. Well here's a good candidate, see screenshot above, just in from Dane :-) ..
The good news is, this pae KDL used to occur without rhyme or reason, but now it would seem to occur only when we run FetchIt, rather than CC or SoundPlay. So in theory we could sell stations with a reduced set of apps (just CC and SP).. That significantly reduces the number of potential buyers though.
comment:21 by , 5 years ago
AGMS told me yesterday FetchIt does not compile for 64-bit yet; we understand it would be worthwhile to be 64-bit-clean for all our apps though, in case things are more stable there. Will report when/if we make progress.
Our immediate concern is still on x32 since it will be several months before we get out of beta and ship on x64.
EDIT (17oct19): we have 64 bit builds of about everything now and are pressing on with that platform, with no KDL in sight yet. Bad news of sorts for this ticket, but looks like good news for us
comment:22 by , 5 years ago
Compiling for 64-bit should be pretty easy, usually takes about 10 minutes or less on most codebases. Getting info on whether or not it triggers there will be very useful in debugging what's going on.
Does this reliably trigger? Or is it intermittent? As you can see, mmlr hits it intermittently on builders, but IIRC it's once every few weeks or so, maybe less. If you have something that triggers it in minutes or hours (or even days), that would be really helpful if you could get a binary to us with instructions...
comment:23 by , 5 years ago
I just got this again now when trying to build TeX Live using Haikuporter. Before there were some compiler crashes and "illegal instruction" internal compiler errors. I am quite sure this is the bug that I mentioned in https://github.com/haikuports/haikuports/issues/4886 a few weeks ago.
This is hrev54302 64 bit. The assertion therefore happened in X86VMTranslationMap64Bit.cpp
comment:24 by , 4 years ago
Hi, Is there a status update from TuneTracker on this one? Since we're reviewing which tickets should go into beta3, it would be nice to know if you need this or if you eventually switched to 64bit and managed to avoid it this way.
comment:25 by , 4 years ago
Well as far as we are concerned this could be closed as we didn't see this in a very long time. But I think mmlr's comment:15 is the one that counts now on this ticket (and even then, I think he commented elsewhere that the build bots are much more stable now, so maybe this bug has less bearing on the build bots these days ?)
Anyway no problem with deferring this ticket to way down the road, as far as TTS is concerned. We're 100% x64 now indeed, in our attempt to resume production.
comment:26 by , 4 years ago
Milestone: | R1 → R1.1 |
---|
The screen shot was attached to #8345 instead of here, sorry about that:
https://dev.haiku-os.org/attachment/ticket/8345/20150109_135535.jpg