Opened 10 years ago

Last modified 4 years ago

#11744 assigned bug

hrev48411: x86VMTranslationMapPAE.cpp:422, *entry&1 == 0

Reported by: ttcoder Owned by: bonefish
Priority: normal Milestone: R1.1
Component: System/Kernel Version: R1/Development
Keywords: Cc:
Blocked By: Blocking: #14157, #14318
Platform: x86

Description

Got this report from a customer, might be worth filing since this occured post hrev48123, in 48411 ; creating a new ticket as per ticket:8345#comment:83

Attachments (11)

listdev.txt (3.5 KB ) - added by ttcoder 10 years ago.
previous_syslog (142.2 KB ) - added by ttcoder 10 years ago.
Looks like this got manhandled by VNC server, but still usable
(bob,hrev48366)IMG_0050.jpg (146.0 KB ) - added by ttcoder 10 years ago.
similar KDL on hrev48366
(WRN,48411,VMCache.LookupPage)IMG953628.jpg (235.8 KB ) - added by ttcoder 10 years ago.
Possibly off topic for this ticket (VMCache.LookupPage() crash, not in PAE code)
slab_syslog.old (512.0 KB ) - added by ttcoder 10 years ago.
Potential pattern of slab allocs. preceding crashes
slab_syslog (77.5 KB ) - added by ttcoder 10 years ago.
Potential pattern of slab allocs. preceding crashes
(49137)IMG_4414.JPG (930.0 KB ) - added by ttcoder 9 years ago.
hrev49137, sh thread, syslog mentions seemingly minor USB errors previous to crash
(49856)20160404_124250.jpg (352.0 KB ) - added by ttcoder 9 years ago.
(49856)20160404_124257.jpg (176.1 KB ) - added by ttcoder 9 years ago.
pte_assert.log (468.8 KB ) - added by mmlr 7 years ago.
serial log of the x86_gcc2 builder
IMG_5782.jpg (2.4 MB ) - added by ttcoder 5 years ago.
Dane trying to come up with a stable system -- but he had this KDL (after FetchIt started being used?)

Change History (37)

comment:1 by ttcoder, 10 years ago

The screen shot was attached to #8345 instead of here, sorry about that:

https://dev.haiku-os.org/attachment/ticket/8345/20150109_135535.jpg

comment:2 by anevilyak, 10 years ago

Component: - GeneralSystem/Kernel
Owner: changed from nobody to bonefish
Status: newassigned

Are you able to get any hardware information from that customer for the system this was seen on? A listdev and a syslog from a fresh boot would both be nice, since it'd be helpful to know both the general hardware that's present and things like CPU + amount of RAM.

by ttcoder, 10 years ago

Attachment: listdev.txt added

by ttcoder, 10 years ago

Attachment: previous_syslog added

Looks like this got manhandled by VNC server, but still usable

by ttcoder, 10 years ago

Attachment: (bob,hrev48366)IMG_0050.jpg added

similar KDL on hrev48366

comment:3 by ttcoder, 10 years ago

Got a similar one from a user with hrev48366 -- so less recent, but still above the 48123 threshold.

We disabled the background script running that mkdir command, that should 'fix' it for him.

Edit: while inspecting one of the concerned systems, found a script that recursively calls itself (!).. So presumably creating a new /bin/bash instance every time and never letting go of anything. 6 times an hour equals 140+ times a day, I'm guessing that could cause trouble over a few days *g*.. Anyway replaced the self-call with a proper infinite loop.

Last edited 10 years ago by ttcoder (previous) (diff)

comment:4 by ttcoder, 10 years ago

As to the first KDL referred to in comment:1, dsuden and the station are growing confident that it's likely related to HDA line-input recording: that station (WRN) is using time-triggered recording at very high intensity (dozens of recordings per day IIUC), and they only ever see the KDL crash occur on (some of the) recording start-ups (so that would be a setup with TimeTracker 2.0.7 and hrev48411).

Dunno how the HDA driver/recording could be related to PAE paging... other than its sharing the kernelland address space of course. Anyway hope that inspires..

Edit: @dsuden: I'm curious to see more instances/KDL screenshots though, especially since the one above refers to "query" so probably in one of the background scripts that was running, and we decided to disable/modify the query-based shell script so what happens now.. The crash occurs somewhere else ?

Last edited 10 years ago by ttcoder (previous) (diff)

comment:5 by waddlesplash, 10 years ago

VM page faults + HDA line input + KDLs = sounds like the bug that Julian tried to fix in this commit. His exact words were:

On some boots it was unlucky and then the HDA controller's DMA engine would go haywire and start overwriting random kernel memory with audio input data. The scheduler was usually the first victim and caused page fault KDLs due to corrupted data structures.

However, he notes in that commit that he only got these problems on boot. Perhaps more, similar issues remain in the code that don't happen 100% of the time and so aren't noticed as often...?

comment:6 by ttcoder, 10 years ago

That's a compelling possibility, good catch Augustin -- the reason being, we run our own version of the HDA driver (wish I could merge at least some of the changes into Haiku some day) and I don't sync the code with "trunk" very often, so it's very possible they're running a pre-fix driver, hosted in a more recent Haiku hrev.

Asked them to try to revert to "stock" hda temporarily to test that theory, but apparently they're not able/wiling to do so :-/

EDIT: just realized that Julian's fix includes a dprintf("DMA snooping..") that's easy to grep for in the syslog, I'll ask them to grep for it, we'll see how recent their driver is then.

Last edited 10 years ago by ttcoder (previous) (diff)

comment:7 by ttcoder, 10 years ago

Updates..

  • for the W.R.N. (hrev48411) station with HDA suspected, turns out our driver does include jua's changes indeed:
    KERN: hda: using MSI vector 26
    KERN: hda: DMA snooping: no
    
  • posting a KDL below: occured in thread "sh" so apparently they still have some script running.. I've asked dsuden about that.

by ttcoder, 10 years ago

Possibly off topic for this ticket (VMCache.LookupPage() crash, not in PAE code)

comment:8 by ttcoder, 10 years ago

Got some fresh data from WRN after they got a KDL during an FTP access. Two things of note: BFS errors before, and a resource warning during the FTP access. Can post the whole thing if useful. First the BFS errors:

KERN: slab memory manager: created area 0xf2801000 (895377)
KERN: bfs: InitCheck:323: Bad data
KERN: bfs: inode at 2097162 is already deleted!
KERN: bfs: GetNextMatching:803: Bad data
KERN: bfs: KERN: could not get inode 2097162 in index "size"!
KERN: bfs: InitCheck:323: Bad data
KERN: bfs: KERN: inode at 2097162 is already deleted!
KERN: bfs: GetNextMatching:803: Bad data
KERN: bfs: could not get inode 2097162 in index "size"!
KERN: bfs: InitCheck:323: Bad data
KERN: bfs: inode at 2097162 is already deleted!
(.. about one page high like this ..)

The FTP traffic that was ongoing at the time of the crash looks a tad more harmless:

FTP 'ftpd'[640338]: data setsockopt (TCP_NOPUSH): Invalid Argument
KERN: libnetwork.so running in R5 compatibility mode.
(..)
FTP 'ftpd'[640338]: data setsockopt (TCP_NOPUSH): Invalid Argument
KERN: libnetwork.so running in R5 compatibility mode.
Last message repeated 1 time
KERN: slab memory manager: created area 0xcb001000 (51580897)
KERN: slab memory manager: created area 0xcb801000 (51895142)
KERN: slab memory manager: created area 0xcc001000 (52211306)
KERN: low resource address space: note -> warning
KERN: libnetwork.so running in R5 compatibility mode.
Last message repeated 4 times
KERN: bfs: bfs_open_dir:1644: Not a directory
KERN: options = 0
KERN: No APM available.
KERN: smp: using ACPI to detect MP configuration
(big snip)
KERN: <DROP>Welcome to syslog debug output!
KERN: Haiku revision: hrev48411

Further down I see what looks like another crash, preceded by lots of "slab" chatter.. Is it indicative of them running out of memory ? (we've had reports from another station of SP gobbling up heaps of memory so I'd better ask):

KERN: slab memory manager: created area 0xf6801000 (3352419)
KERN: slab memory manager: created area 0xf7001000 (3594514)
KERN: slab memory manager: created area 0xf7801000 (3888898)
KERN: slab memory manager: created area 0xf8001000 (4203155)
KERN: slab memory manager: created area 0xf8801000 (4518017)
KERN: slab memory manager: created area 0xf9001000 (4832886)
KERN: slab memory manager: created area 0xf9801000 (5151324)
KERN: slab memory manager: created area 0xfa001000 (5459653)
KERN: slab memory manager: created area 0xfa801000 (5774577)
KERN: slab memory manager: created area 0xfb001000 (6067400)
KERN: slab memory manager: created area 0xfb801000 (6385319)
KERN: slab memory manager: created area 0xfc001000 (6698986)
KERN: slab memory manager: created area 0xfc801000 (7015105)
KERN: options = 0
KERN: No APM available.

comment:9 by anevilyak, 10 years ago

Edit: Never mind, misread, that's the end of the previous run leading into a fresh boot.

Last edited 10 years ago by anevilyak (previous) (diff)

by ttcoder, 10 years ago

Attachment: slab_syslog.old added

Potential pattern of slab allocs. preceding crashes

by ttcoder, 10 years ago

Attachment: slab_syslog added

Potential pattern of slab allocs. preceding crashes

comment:10 by ttcoder, 10 years ago

Here's the whole thing; I do see a bunch of slab allocs preceding 2 of the 3 crashes, but don't know what to make of that.

Dsuden informed me there is still a script running (sigh), it's a deep-recursive /bin/find that runs addattr on hundreds of files, every 10 minutes or some such..

by ttcoder, 9 years ago

Attachment: (49137)IMG_4414.JPG added

hrev49137, sh thread, syslog mentions seemingly minor USB errors previous to crash

comment:11 by ttcoder, 9 years ago

A station just had this in 49137. Of note -- I also see a (truncated) "bash-4039-debug-03-11-2015-13-20-19.report" report (looks like a segv in terminate_program()) on their desktop, not sure if it occured at the same time or not.

All in all, a rather rare occurence

Edit: okay, as its filename implies, the bash userland crash occured on Nov. 3rd, not today (Nov. 5th).

Last edited 9 years ago by ttcoder (previous) (diff)

by ttcoder, 9 years ago

Attachment: (49856)20160404_124250.jpg added

by ttcoder, 9 years ago

Attachment: (49856)20160404_124257.jpg added

comment:12 by ttcoder, 9 years ago

(note to self: that split screenshot is from BR radio, they do /not/ have timetracker installed)

First I see this in a long time, and this might be the first time I see this KDL in hrev49856 (by opposed to 49137 which was previously in production for us)

Some oddities in their syslogs, including the usual "ARP address mismatch" stuff but especially: a series of dozens of crashes in "ls", "dirname" and the such

Here's some hardware/SW info in case it's (still) useful:

Haiku shredder 1 hrev49856 Nov 19 2015 01:23:00 BePC x86 Haiku
2 AMD A-Series, revision 610f31 running at 3693MHz
CPU #0: "AMD A4-6300 APU with Radeon(tm) HD Graphics    "
2848808960 bytes free      (used/max  580960256 / 3429769216)
                           (cached   2535731200)
     63850 semaphores free (used/max       1686 /      65536)
      3877 ports free      (used/max        219 /       4096)
      3892 threads free    (used/max        204 /       4096)
      2004 teams free      (used/max         44 /       2048)

Memory usage is high for some reason in these teams: registrar: 41418752 bytes deskbar: 29163520 a few bash scripts at 10 MB+ each

For what it's worth

comment:13 by korli, 8 years ago

Platform: Allx86

comment:14 by korli, 8 years ago

Maybe it would be worth trying to switch to X86PagingMethod32Bit? The kernel drivers setting is "4gb_memory_limit". Uncomment to enable.

comment:15 by mmlr, 7 years ago

This happend today on a x86_gcc2 builder in a kvm VM on maui:

PANIC: ASSERT FAILED (/home/builder/builds/haiku/src/system/kernel/arch/x86/pagi
ng/pae/X86VMTranslationMapPAE.cpp:422): (*entry & 0x0000000000000001LL) == 0; vi
rtual address: 0x111d000, existing pte: 0x36aecee28eb9a675 @ 0x81e658e8
Welcome to Kernel Debugging Land...
Thread 151745 "sh" running on CPU 1
stack trace for thread 151745 "sh"
    kernel stack: 0xcdf92000 to 0xcdf96000
      user stack: 0x71402000 to 0x72402000
frame               caller     <image>:function + offset
 0 cdf95c20 (+  32) 8014f16e   <kernel_x86> arch_debug_stack_trace + 0x12
 1 cdf95c40 (+  16) 800a9d6f   <kernel_x86> stack_trace_trampoline(NULL) + 0x0b
 2 cdf95c50 (+  12) 80140ede   <kernel_x86> arch_debug_call_with_fault_handler + 0x1b
 3 cdf95c5c (+  48) 800ab892   <kernel_x86> debug_call_with_fault_handler + 0x5a
 4 cdf95c8c (+  64) 800a9f8b   <kernel_x86> kernel_debugger_loop(ESC[34m0x80193497ESC[0m ESC[36m"PANIC: "ESC[0m, ESC[34m0x801b5fa0ESC[0m ESC[36m"ASSERT FAILED (
%s:%d): %s; virtual address: %#lx, existing pte: %#llx @ %p"ESC[0m, ESC[34m0xcdf95d38ESC[0m ESC[36m"@_ESC<80><A6>^A"ESC[0m, int32: ESC[34m1ESC[0m) + 0x217
 5 cdf95ccc (+  48) 800aa307   <kernel_x86> kernel_debugger_internal(ESC[34m0x80193497ESC[0m ESC[36m"PANIC: "ESC[0m, ESC[34m0x801b5fa0ESC[0m ESC[36m"ASSERT FAILED (%s:%d): %s; virtual address: %#lx, existing pte: %#llx @ %p"ESC[0m, ESC[34m0xcdf95d38ESC[0m ESC[36m"@_ESC<80><A6>^A"ESC[0m, int32: ESC[34m1ESC[0m) + 0x53
 6 cdf95cfc (+  48) 800abc1e   <kernel_x86> panic + 0x3a
 7 cdf95d2c (+  80) 80149a3e   <kernel_x86> X86VMTranslationMapPAE<ESC[32m0xda857d18ESC[0m>::Map(uint32: ESC[34m0x111d000ESC[0m, uint64: ESC[34m0x81462000ESC[0m, uint32: ESC[34m0x11ESC[0m (ESC[34m17ESC[0m), uint32: ESC[34m0x0ESC[0m (ESC[34m0ESC[0m), vm_page_reservation*: ESC[34m0xcdf95e9cESC[0m) + 0x14e
 8 cdf95d7c (+  80) 8011e7d9   <kernel_x86> map_page(VMArea*: ESC[34m0xb53a1890
ESC[0m, vm_page*: ESC[34m0x84a3efe8ESC[0m, uint32: ESC[34m0x111d000ESC[0m, uint32: ESC[34m0x11ESC[0m (ESC[34m17ESC[0m), vm_page_reservation*: ESC[34m0xcdf95e9c
ESC[0m) + 0xf1
 9 cdf95dcc (+ 224) 80126875   <kernel_x86> vm_soft_fault(VMAddressSpace*: ESC[34m0xb72f7d38ESC[0m, uint32: ESC[34m0x111d000ESC[0m, ESC[34mfalseESC[0m, ESC[34mfalseESC[0m, ESC[34mtrueESC[0m) + 0x641
10 cdf95eac (+ 160) 80125925   <kernel_x86> vm_page_fault + 0xb5
11 cdf95f4c (+  80) 801509b7   <kernel_x86> x86_page_fault_exception + 0x17f
12 cdf95f9c (+  12) 8014391e   <kernel_x86> int_bottom_user + 0x73
user iframe at 0xcdf95fa8 (end = 0xcdf96000)
 eax 0x250c1       ebx 0x111cfdc      ecx 0x723ff95c  edx 0x6159d114
 esi 0x0           edi 0x0            ebp 0x723ff988  esp 0xcdf95fdc
 eip 0x10e9420  eflags 0x13246   user esp 0x723ff960
 vector: 0xe, error code: 0xd
13 cdf95fa8 (+   0) 010e9420   <libroot.so> fork + 0x98
14 723ff988 (+  64) 0155d039   <sh> make_child + 0x11d
15 723ff9c8 (+  96) 0154c8ef   <sh> coproc_unsetvars (nearest) + 0x327b
16 723ffa28 (+ 128) 015478be   <sh> execute_command_internal + 0x7d6
17 723ffaa8 (+ 128) 01549bc0   <sh> coproc_unsetvars (nearest) + 0x54c
18 723ffb28 (+  96) 0154a1c1   <sh> coproc_unsetvars (nearest) + 0xb4d
19 723ffb88 (+ 128) 01547bc8   <sh> execute_command_internal + 0xae0
20 723ffc08 (+  80) 015996df   <sh> parse_and_execute + 0x43f
21 723ffc58 (+  96) 0156bb75   <sh> command_substitute + 0x47d
22 723ffcb8 (+4576) 01571c42   <sh> invalidate_cached_quoted_dollar_at (nearest) + 0xae6
23 72400e98 (+  64) 015678cd   <sh> cond_expand_word (nearest) + 0x10d
24 72400ed8 (+  80) 01567b18   <sh> expand_string_assignment + 0xb0
25 72400f28 (+2272) 015671d0   <sh> string_rest_of_args (nearest) + 0x304
26 72401808 (+  80) 015669d4   <sh> strip_trailing_ifs_whitespace (nearest) + 0x418
27 72401858 (+  48) 01566c65   <sh> do_word_assignment + 0x21
28 72401888 (+  64) 0157413b   <sh> expand_words_shellexp (nearest) + 0xa6b
29 724018c8 (+  48) 015736a1   <sh> expand_words + 0x21
30 724018f8 (+  96) 0154ca52   <sh> coproc_unsetvars (nearest) + 0x33de
31 72401958 (+ 128) 015478be   <sh> execute_command_internal + 0x7d6
32 724019d8 (+  80) 01546f02   <sh> execute_command + 0x62
33 72401a28 (+  80) 0154a0ef   <sh> coproc_unsetvars (nearest) + 0xa7b
34 72401a78 (+ 128) 01547bc8   <sh> execute_command_internal + 0xae0
35 72401af8 (+  80) 01546f02   <sh> execute_command + 0x62
36 72401b48 (+  80) 0154a0ef   <sh> coproc_unsetvars (nearest) + 0xa7b
37 72401b98 (+ 128) 01547bc8   <sh> execute_command_internal + 0xae0
38 72401c18 (+  80) 01546f02   <sh> execute_command + 0x62
39 72401c68 (+  80) 0154a0ef   <sh> coproc_unsetvars (nearest) + 0xa7b
40 72401cb8 (+ 128) 01547bc8   <sh> execute_command_internal + 0xae0
41 72401d38 (+  80) 01546f02   <sh> execute_command + 0x62
42 72401d88 (+  64) 0154a69d   <sh> coproc_unsetvars (nearest) + 0x1029
43 72401dc8 (+ 112) 01547a64   <sh> execute_command_internal + 0x97c
44 72401e38 (+  80) 01546f02   <sh> execute_command + 0x62
45 72401e88 (+  64) 01535fd1   <sh> reader_loop + 0x289
46 72401ec8 (+  80) 01533b73   <sh> main + 0x94b
47 72401f18 (+  48) 0153310f   <sh> _start + 0x5b
48 72401f48 (+  48) 00a15cfe   </boot/system/runtime_loader@0x00a04000> <unknown> + 0x11cfe
49 72401f78 (+   0) 6159d250   <commpage> commpage_thread_exit + 0x00

This happened after around 40 haikuporter builds. The VM is configured with 6 GiB of RAM. The machine didn't take any input anymore, so wasn't able to check anything in KDL.

by mmlr, 7 years ago

Attachment: pte_assert.log added

serial log of the x86_gcc2 builder

comment:16 by mmlr, 6 years ago

Blocking: 14157 added

comment:17 by waddlesplash, 6 years ago

Blocking: 14318 added

comment:18 by waddlesplash, 6 years ago

FWIW, I tried to fix the TODO in fork_team about not wanting to copy-on-write the heap by doing a lock/unlock memory:

-			if (info.area == parentThread->user_stack_area)
+			if (info.area == parentThread->user_stack_area) {
 				thread->user_stack_area = area;
+				// Locking and then immediately unlocking the stack memory has
+				// the effect of fully wiring it, which is much faster than
+				// letting it get copied on write.
+				lock_memory_etc(team->address_space->ID(), address,
+					info.size, 0);
+				unlock_memory_etc(team->address_space->ID(), address,
+					info.size, 0);
+			}

However, this caused a KDL on fork() at startup (done in sh, even) about not being able to unwire pages. Probably a separate issue, but, maybe related.

by ttcoder, 5 years ago

Attachment: IMG_5782.jpg added

Dane trying to come up with a stable system -- but he had this KDL (after FetchIt started being used?)

comment:19 by ttcoder, 5 years ago

Not opening a new ticket since the KDL Dane had today seems to fit this long-standing one...

@waddlesplash you asked me off-ticket to explain what issues prevent us from making sales.. Well here's a good candidate, see screenshot above, just in from Dane :-) ..

The good news is, this pae KDL used to occur without rhyme or reason, but now it would seem to occur only when we run FetchIt, rather than CC or SoundPlay. So in theory we could sell stations with a reduced set of apps (just CC and SP).. That significantly reduces the number of potential buyers though.

comment:20 by waddlesplash, 5 years ago

This is on a 32-bit system. Does the KDL also reproduce under 64-bit?

comment:21 by ttcoder, 5 years ago

AGMS told me yesterday FetchIt does not compile for 64-bit yet; we understand it would be worthwhile to be 64-bit-clean for all our apps though, in case things are more stable there. Will report when/if we make progress.

Our immediate concern is still on x32 since it will be several months before we get out of beta and ship on x64.

EDIT (17oct19): we have 64 bit builds of about everything now and are pressing on with that platform, with no KDL in sight yet. Bad news of sorts for this ticket, but looks like good news for us

Last edited 5 years ago by ttcoder (previous) (diff)

comment:22 by waddlesplash, 5 years ago

Compiling for 64-bit should be pretty easy, usually takes about 10 minutes or less on most codebases. Getting info on whether or not it triggers there will be very useful in debugging what's going on.

Does this reliably trigger? Or is it intermittent? As you can see, mmlr hits it intermittently on builders, but IIRC it's once every few weeks or so, maybe less. If you have something that triggers it in minutes or hours (or even days), that would be really helpful if you could get a binary to us with instructions...

comment:23 by jmairboeck, 4 years ago

I just got this again now when trying to build TeX Live using Haikuporter. Before there were some compiler crashes and "illegal instruction" internal compiler errors. I am quite sure this is the bug that I mentioned in https://github.com/haikuports/haikuports/issues/4886 a few weeks ago.

This is hrev54302 64 bit. The assertion therefore happened in X86VMTranslationMap64Bit.cpp

Last edited 4 years ago by jmairboeck (previous) (diff)

comment:24 by pulkomandy, 4 years ago

Hi, Is there a status update from TuneTracker on this one? Since we're reviewing which tickets should go into beta3, it would be nice to know if you need this or if you eventually switched to 64bit and managed to avoid it this way.

comment:25 by ttcoder, 4 years ago

Well as far as we are concerned this could be closed as we didn't see this in a very long time. But I think mmlr's comment:15 is the one that counts now on this ticket (and even then, I think he commented elsewhere that the build bots are much more stable now, so maybe this bug has less bearing on the build bots these days ?)

Anyway no problem with deferring this ticket to way down the road, as far as TTS is concerned. We're 100% x64 now indeed, in our attempt to resume production.

comment:26 by pulkomandy, 4 years ago

Milestone: R1R1.1
Note: See TracTickets for help on using tickets.