Opened 7 years ago

Closed 5 years ago

Last modified 4 years ago

#8345 closed bug (fixed)

PANIC: ASSERT FAILED ... x86/paging/pae/x86VMTranslationMapPAE.cpp:231

Reported by: kallisti5 Owned by: bonefish
Priority: high Milestone: R1/beta1
Component: System/Kernel Version: R1/Development
Keywords: vm pae Cc: smc.collins@…, mdisreali@…, umccullough, degea@…
Blocked By: Blocking: #9435, #9520, #9677
Has a Patch: no Platform: x86

Description (last modified by bonefish)

I've seen this a few times now randomly after doing lots of compiling under Haiku GCC4...

PANIC: ASSERT FAILED ... x86/paging/pae/x86VMTranslationMapPAE.cpp:231
(*entry & 0x0000000000000001LL) == 0; virtual address: 0x7ffef000, existing pte: 0xffffffffffffffff

http://cgit.haiku-os.org/haiku/tree/src/system/kernel/arch/x86/paging/pae/X86VMTranslationMapPAE.cpp?id=hrev43765#n229

Attachments (20)

IMG_20120220_131211.jpg (99.8 KB) - added by kallisti5 7 years ago.
IMG_20120220_131211.2.jpg (99.8 KB) - added by kallisti5 7 years ago.
IMG_20120220_131220.jpg (69.2 KB) - added by kallisti5 7 years ago.
page_stats_failure.jpg (162.2 KB) - added by kallisti5 7 years ago.
fail_optiplex_page_stats.jpg
Haiku_8345_Ticket.patch (619 bytes) - added by haxworx 7 years ago.
20130501_004.jpg (640.5 KB) - added by calvin 6 years ago.
Haiku hrev45595 failing
20130501_006.jpg (897.8 KB) - added by calvin 6 years ago.
Haiku hrev45595 failing (backtrace)
20130501_007.jpg (798.9 KB) - added by calvin 6 years ago.
Haiku hrev45595 failing (VESA video)
20130501_008.jpg (1015.6 KB) - added by calvin 6 years ago.
Haiku hrev45595 failing (VESA video, backtrace)
KDL.jpg (735.0 KB) - added by humdinger 5 years ago.
kdl on hrev46690
port.txt (9.8 KB) - added by oco 5 years ago.
port.2.txt (46.0 KB) - added by oco 5 years ago.
Serial trace with 2 CPUs in my virtualbox machine
Haiku_r46706_janvier_2014.vbox (7.1 KB) - added by oco 5 years ago.
20140121_214836.jpg (1.6 MB) - added by SeanCollins 5 years ago.
crash on latest ggc2 hybrid hrev46726
xaxes_2014-01-27_16_32_05.png (77.2 KB) - added by Premislaus 5 years ago.
xaxes_2014-01-27_16_32_16.png (66.1 KB) - added by Premislaus 5 years ago.
kvm_setup.txt (3.1 KB) - added by Premislaus 5 years ago.
IMG_20140605_164833.jpg (977.8 KB) - added by ttcoder 5 years ago.
Customer report from earlier this week, he's still running hrev46104 so might be a symptom already fixed by bonefish in PM or otherwise not relevant
areas.06192014.txt (126.0 KB) - added by kallisti5 5 years ago.
areas of kdl, Jun 19th, 2014
20150109_135535.jpg (3.3 MB) - added by dsuden 4 years ago.
Still seeing that KDL in Haiku shredder 1 hrev48411 Dec 1 2014 14:26:35 BePC x86 Haiku

Change History (106)

comment:1 Changed 7 years ago by kallisti5

Description: modified (diff)

Changed 7 years ago by kallisti5

Attachment: IMG_20120220_131211.jpg added

Changed 7 years ago by kallisti5

Attachment: IMG_20120220_131211.2.jpg added

Changed 7 years ago by kallisti5

Attachment: IMG_20120220_131220.jpg added

comment:2 Changed 7 years ago by kallisti5

Haiku: hrev43717 gcc4 hybrid not dirty.

comment:3 Changed 7 years ago by kallisti5

Milestone: R1R1/alpha4
Priority: normalcritical

speaking to mmlr, he mentions also seeing this issue:

3:20 <@mmlr> kallisti5: I've seen that translationmap error as well
13:20 <@kallisti5> mmlr: thats good to hear :)
13:20 <@kallisti5> i've seen it quite a few times as of late
13:20 <@kallisti5> mmlr: you don't use radeon_hd do you?
13:20 <@mmlr> I don't
13:21 <@kallisti5> i'm just asking because it came from screen_blanker
13:21 <@kallisti5> ok
13:21 <@mmlr> I think it's some kind of edge case when nearly exhausting physical pages
13:21 <@kallisti5> mmlr: think it may be a blocker for a4?
13:21 <@mmlr> I've seen it last while stress testing things with the guard heap
13:21 <@mmlr> I originally thought it might be related to me manipulating the entries directly
13:22 <@mmlr> but that confirms it happens without guard heap as well
13:22 <@mmlr> well, technically it is kind of critical
13:23 <@mmlr> not sure if someone actually is going to have the time to look into it

Marking a critical bug for alpha4, the random system crash bugs are pretty annoying :)

comment:4 Changed 7 years ago by anevilyak

Owner: changed from axeld to bonefish
Status: newassigned

comment:5 Changed 7 years ago by SeanCollins

Kallisti, I get this crash allot to, I can get around it by disablling all but one cpu core on my machine. Could you try that on your end as well ? I use pulse to disable all but one cpu core.

comment:6 Changed 7 years ago by SeanCollins

Cc: smc.collins@… added

comment:7 Changed 7 years ago by kallisti5

Still happens on hrev44242

I disabled hyper-threading which drops the cores to one vs two.. Will report back status after some testing.

comment:8 Changed 7 years ago by bonefish

Description: modified (diff)

I just noticed I hadn't commented on that one yet. While the issue itself isn't all that critical the underlying cause possibly is. The Map() method of the translation map maps a virtual address to a physical page. That is achieved by writing a respectively composed value (the physical page address mixed with flags for the access permissions and other stuff) into a respective page table entry (PTE). It is expected that when Map() is invoked that the virtual address is not already mapped to a physical page. That's what the assert checks (the "page present" flag of the PTE).

Since the new value is written to the PTE, it will be OK afterwards -- the panic is continuable -- but that the previous value got there is worrisome. Even more so since in this case the value is ~0, something never written by the paging code. So someone else wrote into that page table. Not a comforting thought.

Michael mentioned that he thinks that this happens when nearly exhausting physical pages. When it happens next time, please run a page_stats so we can verify the theory. Also, it would be interesting in what situation the assert was triggered. Like how long the machine ran and what you did before. Obviously a reproducible test case would be perfect.

comment:9 Changed 7 years ago by kallisti5

I've tried continuing on the panic, however it reoccurs over and over (same crash though)

I did disable hyperthreading in the bios (reducing the cpu count to 1 from 2) and the crash seems to be gone.

Is this a "multi-cpu-smp" issue or strictly a "hyperthreading cpu" issue? SeanCollins, do you have a Pentium 4 HT cpu, or a multi-core cpu?

Steps to reproduce for me...

  • Multiple HT cores
  • Disable screensaver and screen blanking
  • Do some work, compile stuff
  • *crash* after a few minutes to one hour.

comment:10 in reply to:  9 Changed 7 years ago by anevilyak

Replying to kallisti5:

Is this a "multi-cpu-smp" issue or strictly a "hyperthreading cpu" issue? SeanCollins, do you have a Pentium 4 HT cpu, or a multi-core cpu?

As another data point, quad core w/ hyperthreading here (ergo 8 CPUs visible to the OS), and I've never seen this under all kinds of punishing workloads. That having been said, 8GB of physical RAM is available, so if it's the low memory condition triggering it, then I'm not easily going to hit that. However, if Ingo's theory about something else overwriting the page tables is correct, then it seems plausible that a particular misbehaving driver in your case might be triggering it, else one would expect it to be much more widespread. The type of CPUs involved really doesn't have much to do with how the memory mapping's handled per se (excepting the obvious availability or absence of PAE), and from a multitasking standpoint, if anything HT should be less strenuous than true multicore since the circumstances under which it can multitask are more limited.

comment:11 Changed 7 years ago by kallisti5

Yeah, I definitely think you're right here.

I have a similar Optiplex GX620 next to me with HT that I can't seem to reproduce the issue on.

SeanCollins, could you get a list of drivers on the your system? I'll do the same for mine and we can see what aligns.

listimage | grep drivers

comment:12 Changed 7 years ago by kallisti5

here is the working Optiplex GX620 as a baseline:

~> listimage | grep drivers
   54                  /boot/system/add-ons/kernel/drivers/dev/console 0x8189c000 0x8189e000    0          0
   55                  /boot/system/add-ons/kernel/drivers/dev/dprintf 0x8176d000 0x8176e000    0          0
   56                     /boot/system/add-ons/kernel/drivers/dev/null 0x816d2000 0x816d3000    0          0
   57                   /boot/system/add-ons/kernel/drivers/dev/random 0x81666000 0x81669000    0          0
   58                      /boot/system/add-ons/kernel/drivers/dev/tty 0x81773000 0x8177b000    0          0
   59                     /boot/system/add-ons/kernel/drivers/dev/zero 0x8166a000 0x8166b000    0          0
  964       /boot/system/add-ons/kernel/drivers/dev/graphics/radeon_hd 0x81126000 0x8112b000    0          0
  985            /boot/system/add-ons/kernel/drivers/dev/graphics/vesa 0x81883000 0x81886000    0          0
  991                 /boot/system/add-ons/kernel/drivers/dev/net/3com 0xcd784000 0xcd79c000    0          0
 1001         /boot/system/add-ons/kernel/drivers/dev/net/broadcom570x 0xcd79e000 0xcd7bb000    0          0
 1157            /boot/system/add-ons/kernel/drivers/dev/input/usb_hid 0x81ab2000 0x81abe000    0          0
 1183              /boot/system/add-ons/kernel/drivers/dev/input/wacom 0x81abf000 0x81ac1000    0          0
 1630        /boot/system/add-ons/kernel/drivers/dev/power/acpi_button 0xcd7fe000 0xcd7ff000    0          0
 1660              /boot/system/add-ons/kernel/drivers/dev/bus/usb_raw 0x81b63000 0x81b65000    0          0
 1664       /boot/system/add-ons/kernel/drivers/dev/audio/hmulti/auich 0x81b8f000 0x81b99000    0          0
 1684         /boot/system/add-ons/kernel/drivers/dev/audio/hmulti/hda 0x81bb2000 0x81bbc000    0          0

comment:13 Changed 7 years ago by bonefish

Keywords: vm pae added; x86VMTranslationMapPAE removed

I wouldn't rule out a driver issue, but it might be a VM issue as well. There's ticket #2718 which hints at some page management problem. The ticket is rather old and might have been fixed in the meantime (e.g. when I did some extensive VM work two years ago).

But very interesting is ticket #7904. There have been issues with VirtualBox for ages -- and IIRC at least initially VB indeed didn't emulate the architecture correctly -- so I was assuming this could just be some weird VB issue. However, the third KDL picture in that ticket shows the same assert that we see here, just in the 32-bit paging code. The other two KDL pictures show an assert in the higher-level page management code.

So it may not be a VB issue. It is possible that the way VB emulates multiple CPUs just triggers the Haiku bug more likely. It would also suggest that the bug is in the page management code. E.g. a free page being handed out twice would also easily explain overwritten memory. It's probably worth trying to track this down with VB.

@kallisti5: I assume you've got at least 4 GB RAM and from your description what you do it doesn't sound like you would have exhausted that much memory. A page_stats wouldn't harm anyway.

comment:14 Changed 7 years ago by kallisti5

Yeah. the machine has 5.6GB ram installed and is a small form factor Optiplex GX620

Drivers on the failing system:

~> listimage | grep driver
   54                  /boot/system/add-ons/kernel/drivers/dev/console 0x8189a000 0x8189c000    0          0
   55                  /boot/system/add-ons/kernel/drivers/dev/dprintf 0x8176d000 0x8176e000    0          0
   56                     /boot/system/add-ons/kernel/drivers/dev/null 0x81665000 0x81666000    0          0
   57                   /boot/system/add-ons/kernel/drivers/dev/random 0x81667000 0x8166a000    0          0
   58                      /boot/system/add-ons/kernel/drivers/dev/tty 0x81773000 0x8177b000    0          0
   59                     /boot/system/add-ons/kernel/drivers/dev/zero 0x8166b000 0x8166c000    0          0
  872       /boot/system/add-ons/kernel/drivers/dev/graphics/radeon_hd 0x81109000 0x8110e000    0          0
  896            /boot/system/add-ons/kernel/drivers/dev/graphics/vesa 0x81138000 0x8113b000    0          0
  908         /boot/system/add-ons/kernel/drivers/dev/net/broadcom570x 0xcd754000 0xcd771000    0          0
 1104            /boot/system/add-ons/kernel/drivers/dev/input/usb_hid 0xde80c000 0xde818000    0          0
 1130              /boot/system/add-ons/kernel/drivers/dev/input/wacom 0x81d3d000 0x81d3f000    0          0
 1464        /boot/system/add-ons/kernel/drivers/dev/power/acpi_button 0xcd7ce000 0xcd7cf000    0          0
 1566              /boot/system/add-ons/kernel/drivers/dev/bus/usb_raw 0x81919000 0x8191b000    0          0
 1570       /boot/system/add-ons/kernel/drivers/dev/audio/hmulti/auich 0x81d85000 0x81d8f000    0          0

working on reproducing issue now for a page_stats output.

comment:15 in reply to:  11 Changed 7 years ago by SeanCollins

Replying to kallisti5:

Yeah, I definitely think you're right here.

I have a similar Optiplex GX620 next to me with HT that I can't seem to reproduce the issue on.

SeanCollins, could you get a list of drivers on the your system? I'll do the same for mine and we can see what aligns.

listimage | grep drivers

Sure thing

   54                  /boot/system/add-ons/kernel/drivers/dev/console 0x819c3000 0x819c5000    0          0
   55                  /boot/system/add-ons/kernel/drivers/dev/dprintf 0x81a64000 0x81a65000    0          0
   56                     /boot/system/add-ons/kernel/drivers/dev/null 0x81a21000 0x81a22000    0          0
   57                   /boot/system/add-ons/kernel/drivers/dev/random 0x819c6000 0x819c9000    0          0
   58                      /boot/system/add-ons/kernel/drivers/dev/tty 0x81ece000 0x81ed7000    0          0
   59                     /boot/system/add-ons/kernel/drivers/dev/zero 0x819ca000 0x819cb000    0          0
 1137       /boot/system/add-ons/kernel/drivers/dev/graphics/radeon_hd 0x81156000 0x8115c000    0          0
 1139            /boot/system/add-ons/kernel/drivers/dev/graphics/vesa 0x81139000 0x8113d000    0          0
 1167              /boot/system/add-ons/kernel/drivers/dev/net/rtl8139 0x816c0000 0x816d7000    0          0
 1168              /boot/system/add-ons/kernel/drivers/dev/net/rtl81xx 0x816ea000 0x81704000    0          0
 1289            /boot/system/add-ons/kernel/drivers/dev/input/usb_hid 0x8195d000 0x81969000    0          0
 1290              /boot/system/add-ons/kernel/drivers/dev/input/wacom 0x8196a000 0x8196c000    0          0
 1701        /boot/system/add-ons/kernel/drivers/dev/power/acpi_button 0x818f5000 0x818f6000    0          0
 1723             /boot/system/add-ons/kernel/drivers/dev/midi/ice1712 0xcda22000 0xcda28000    0          0
 1824               /boot/system/add-ons/kernel/drivers/dev/bus/fw_raw 0x8226c000 0x8226f000    0          0
 1826              /boot/system/add-ons/kernel/drivers/dev/bus/usb_raw 0x822f5000 0x822f7000    0          0
 1832         /boot/system/add-ons/kernel/drivers/dev/audio/hmulti/hda 0x82300000 0x8230a000    0          0
TEAM  642 (/bin/grep drivers):
~> 

Some early pentium 4 chips with HT will not even boot without crashing the kernel at memory setup. I have a few of these cpu's, if someone wants one to test with. I have all AMD hardware in my machine. And this diffenitly only happens when using multiple cpu's during code compile. I can't reproduce it any other time. I will say that it does seem to happen more often when the cached memory gets full, IE when the cached memory useage gets close to the max physical memory, this seems to occur more often.so maybe its some goofy issue with multiple cpu cores acessing cached memory ?? or maybe things are being cached in the l2/l3 areas of the cpu ? But this definately occurs on my thuban 6 core cpu only when compiling with all 6 core enabled. I can disable 5 of these core, no problem, and its not a particular core, I tried that to as a way to rule out a bad cpu.

Changed 7 years ago by kallisti5

Attachment: page_stats_failure.jpg added

fail_optiplex_page_stats.jpg

comment:16 Changed 7 years ago by bonefish

The page_stats show > 4GB of free pages. I guess this refutes the theory that it happens when free memory is nearly exhausted.

comment:17 Changed 7 years ago by Disreali

Cc: mdisreali@… added

comment:18 Changed 7 years ago by kallisti5

I just realized my "working" GX620 only has 2GB ram in it.. so it may also suffer the same issue. I'm trying to find a few sticks to bump it it to 4GB+ now.

Changed 7 years ago by haxworx

Attachment: Haiku_8345_Ticket.patch added

comment:19 Changed 7 years ago by haxworx

Has a Patch: set

comment:20 Changed 7 years ago by haxworx

Added Haiku_8345_Ticket.patch

Sample code, think it is a presedence issue.

#include <stdio.h>

int main(void) {

int i = 0; int table[4096];

for (i = 0; i < 4096; i++) {

table[i] = i +2;

}

printf("Test One %d %d\n", table + 2048 / 1024 % 512, table[2048] / 1024 % 512);

exit(0);

}

comment:21 in reply to:  20 ; Changed 7 years ago by anevilyak

Replying to haxworx:

Your test case doesn't actually match your patch ; the patch and the original code are logically equivalent ( / and % have higher precedence than +), but the expression that your sample code uses to calculate the second part isn't the same, and not correct for this case anyways.

comment:22 Changed 7 years ago by kallisti5

I can confirm that this bug still exists on the latest R1A4 RC. (hrevr1alpha4-44608)

So far the following has been established:

  • Occurs on gcc2 and gcc4 images
  • Occurs with one or multiple cpu cores
  • Only occurs when system memory is > 4GB (PAE active)
  • Takes some time to occur. Generally if I boot the system, disable the screensaver, and let it sit overnight it:
    • Crashes on its own.
    • Crashes as soon as you move the mouse ~1 inch
Last edited 7 years ago by kallisti5 (previous) (diff)

comment:23 Changed 7 years ago by leavengood

I suggest due to the infrequency of this and the required memory to make it occur, that we not consider it an alpha4 blocker.

comment:24 in reply to:  22 ; Changed 7 years ago by bonefish

Replying to kallisti5:

I can confirm that this bug still exists on the latest R1A4 RC. (hrevr1alpha4-44608)

So far the following has been established:

  • Occurs on gcc2 and gcc4 images
  • Occurs with one or multiple cpu cores

In your and Sean's comments above I actually read that it doesn't happen with only one CPU (one logical CPU). That would be consistent with the observations in #7904.

  • Only occurs when system memory is > 4GB (PAE active)

AFAICT no one has confirmed yet that on the same machine the problem does cease to occur with PAE disabled. #7904 demonstrates that a KDL with the analogous stack trace is possible with 32 bit paging. I would consider it more likely that the issue occurs only on specific CPU architectures or models. It would be nice, if everyone who has tried to reproduce the issue could list their corresponding CPU models for positive and negative results.

I have a negative for Intel Core i7 860.

Also, it would be interesting to learn whether the machines that have the issue also test positive for #7904. Since that test is much quicker and more reliable, it would be more suitable for debugging.

Replying to leavengood:

I suggest due to the infrequency of this and the required memory to make it occur, that we not consider it an alpha4 blocker.

Sure. If #7904 refers to the same problem, the issue isn't new anyway.

comment:25 Changed 7 years ago by SeanCollins

Bonefish, I can't reproduce 7904, and I tried for a good 20 minutes, I have run out of things to install.

If its helpful, I have a p4 cpu that panics just like this, simply by booting it to userland. Maybe I should just chuck the CPU and MB in the mail to you and see if it helps find the issue. I can tell you its got something to do with hyperthreading, becuase if you turn off hyperthreading, the machine works fine. If you want it, I'll chuck it in a small box and send it across the pond. At this point its the best hope to discovering the problem imho and shipping ain't no big cost.

comment:26 in reply to:  24 Changed 7 years ago by anevilyak

Replying to bonefish:

I have a negative for Intel Core i7 860.

Ditto negative on i7-2600k w/ 8GB of RAM.

comment:27 in reply to:  25 ; Changed 7 years ago by bonefish

Replying to SeanCollins:

Bonefish, I can't reproduce 7904, and I tried for a good 20 minutes, I have run out of things to install.

OK, thanks for trying!

If its helpful, I have a p4 cpu that panics just like this, simply by booting it to userland. Maybe I should just chuck the CPU and MB in the mail to you and see if it helps find the issue. I can tell you its got something to do with hyperthreading, becuase if you turn off hyperthreading, the machine works fine. If you want it, I'll chuck it in a small box and send it across the pond. At this point its the best hope to discovering the problem imho and shipping ain't no big cost.

I have an old unplugged P4 box sitting around myself. I guess I should try that one first. I don't quite recall how much memory it has, though. Might only be 2 GB.

comment:28 Changed 7 years ago by kallisti5

Bonefish, I can't reproduce 7904, and I tried for a good 20 minutes, I have run out of things to install.

It generally takes a good hour or two to reproduce.

I should note that I've seen this bug on mostly p4 era Dell Optiplex machines such as the GX620.

I have the hardware available, if anyone can think of anything that may help us troubleshoot this let me know! I'll get it ASAP

comment:29 in reply to:  27 Changed 7 years ago by SeanCollins

Replying to bonefish:

Replying to SeanCollins:

Bonefish, I can't reproduce 7904, and I tried for a good 20 minutes, I have run out of things to install.

OK, thanks for trying!

If its helpful, I have a p4 cpu that panics just like this, simply by booting it to userland. Maybe I should just chuck the CPU and MB in the mail to you and see if it helps find the issue. I can tell you its got something to do with hyperthreading, becuase if you turn off hyperthreading, the machine works fine. If you want it, I'll chuck it in a small box and send it across the pond. At this point its the best hope to discovering the problem imho and shipping ain't no big cost.

I have an old unplugged P4 box sitting around myself. I guess I should try that one first. I don't quite recall how much memory it has, though. Might only be 2 GB.

Ram did not matter on the p4 machines I have seen,

comment:30 Changed 7 years ago by axeld

I've also seen this one from time to time on a i7 3770s (4 cores + hyperthreading), 8 GB of RAM.

comment:31 Changed 7 years ago by SeanCollins

Has a Patch: unset

@bonefish other kernel experts

Disable the virtual memory on your machine, it seems to get mine to trip into this issue much easier with it off.

comment:32 Changed 7 years ago by nielx

Has a Patch: set

comment:33 Changed 7 years ago by nielx

Testing whether the patch flag is persistent. Sorry for the spam.

comment:34 Changed 7 years ago by scottmc

Milestone: R1/alpha4R1/beta1

No recent action has been taken on this one, so moving to the next milestone.

comment:35 Changed 6 years ago by kallisti5

I just saw this issue on my new IvyBridge ultrabook while compiling clang. gcc4. 10GB DDR3 Ram.

comment:36 Changed 6 years ago by kallisti5

I booted my IvyBridge laptop again and tried compiling clang. While compiling, I noticed my laptop had 100% memory usage. As this laptop has 10GB ram installed... it raised a few eyebrows. I entered the kdl by keystroke and took a screenshot of the page_stats.

EDIT: Nevermind. I think the clang bep is creating too many threads.

Last edited 6 years ago by kallisti5 (previous) (diff)

comment:37 Changed 6 years ago by anevilyak

Blocking: 9435 added

comment:38 Changed 6 years ago by Giova84

Same issue for me.

Ϟ 21:52:36 Ϟ ~ listimage | grep drivers
   51                  /boot/system/add-ons/kernel/drivers/dev/console 0x81706000 0x81708000    0          0
   52                  /boot/system/add-ons/kernel/drivers/dev/dprintf 0x8176b000 0x8176c000    0          0
   53                     /boot/system/add-ons/kernel/drivers/dev/null 0x8173e000 0x8173f000    0          0
   54                   /boot/system/add-ons/kernel/drivers/dev/random 0x81780000 0x81783000    0          0
   55                      /boot/system/add-ons/kernel/drivers/dev/tty 0x81788000 0x81791000    0          0
   56                     /boot/system/add-ons/kernel/drivers/dev/zero 0x81745000 0x81746000    0          0
  905   /boot/system/add-ons/kernel/drivers/dev/graphics/intel_extreme 0x818be000 0x818c3000    0          0
  913            /boot/system/add-ons/kernel/drivers/dev/graphics/vesa 0x81666000 0x8166a000    0          0
 1097             /boot/system/add-ons/kernel/drivers/dev/net/ipro1000 0x81f36000 0x81f86000    0          0
 1230              /boot/system/add-ons/kernel/drivers/dev/input/wacom 0x81fb4000 0x81fb6000    0          0
 1754        /boot/system/add-ons/kernel/drivers/dev/power/acpi_button 0x8116c000 0x8116d000    0          0
 1868              /boot/system/add-ons/kernel/drivers/dev/bus/usb_raw 0xcd782000 0xcd784000    0          0
 1880         /boot/system/add-ons/kernel/drivers/dev/audio/hmulti/hda 0xcd785000 0xcd790000    0          0

comment:39 in reply to:  21 Changed 6 years ago by mmadia

Replying to anevilyak:

Replying to haxworx:

Your test case doesn't actually match your patch ; the patch and the original code are logically equivalent ( / and % have higher precedence than +), but the expression that your sample code uses to calculate the second part isn't the same, and not correct for this case anyways.

Should attachment:Haiku_8345_Ticket.patch be marked obsolete then?

comment:40 Changed 6 years ago by anevilyak

Blocking: 9520 added

comment:41 Changed 6 years ago by anevilyak

Blocking: 9677 added

(In #9677) Thought this looked familiar...this is in fact a duplicate of #8345, though in the latter's case, none of the reporters are able to reliably replicate it like here, so that makes this case potentially a bit more interesting.

comment:42 Changed 6 years ago by phoudoin

Now that it seems more reproducible, may be interesting to trace pageTable value and pageTableEntry index in that table: is there a pattern regarding pageTable or PTE position or not.

AFAIK, only kernel land stuffs or DMA access could change these PAE tables content over VM shoulder, right?

Last edited 6 years ago by phoudoin (previous) (diff)

comment:43 in reply to:  42 Changed 6 years ago by bonefish

Replying to phoudoin:

Now that it seems more reproducible, may be interesting to trace pageTable value and pageTableEntry index in that table: is there a pattern regarding pageTable or PTE position or not.

The page table is simply some physical page allocated for that purpose. There aren't any address restrictions, so it can be any page. The PTE's address depends on page table address and mapped virtual address. If you mean, whether there's a patter wrt. where the errors occur, I haven't seen any.

AFAIK, only kernel land stuffs or DMA access could change these PAE tables content over VM shoulder, right?

I think the most likely case is that somehow a physical page is doubly-used -- e.g. handed out twice by the page allocation code or not correctly unmapped by someone -- for a page table and at the same time by someone else.

Changed 6 years ago by calvin

Attachment: 20130501_004.jpg added

Haiku hrev45595 failing

Changed 6 years ago by calvin

Attachment: 20130501_006.jpg added

Haiku hrev45595 failing (backtrace)

Changed 6 years ago by calvin

Attachment: 20130501_007.jpg added

Haiku hrev45595 failing (VESA video)

Changed 6 years ago by calvin

Attachment: 20130501_008.jpg added

Haiku hrev45595 failing (VESA video, backtrace)

comment:44 Changed 6 years ago by calvin

Can reproduce with Pentium D 925, 1 GB DDR2, Radeon HD 5450, and SiS 191 ethernet

Attached images are mine, here's Linux output for cpuinfo and lspci in case:

calvin@fulcrum:~$ lspci
00:00.0 Host bridge: Silicon Integrated Systems [SiS] 671MX
00:01.0 PCI bridge: Silicon Integrated Systems [SiS] PCI-to-PCI bridge
00:02.0 ISA bridge: Silicon Integrated Systems [SiS] SiS968 [MuTIOL Media IO] (rev 01)
00:02.5 IDE interface: Silicon Integrated Systems [SiS] 5513 [IDE] (rev 01)
00:03.0 USB Controller: Silicon Integrated Systems [SiS] USB 1.1 Controller (rev 0f)
00:03.1 USB Controller: Silicon Integrated Systems [SiS] USB 1.1 Controller (rev 0f)
00:03.3 USB Controller: Silicon Integrated Systems [SiS] USB 2.0 Controller
00:04.0 Ethernet controller: Silicon Integrated Systems [SiS] 191 Gigabit Ethernet Adapter (rev 02)
00:05.0 IDE interface: Silicon Integrated Systems [SiS] SATA Controller / IDE mode (rev 03)
00:06.0 PCI bridge: Silicon Integrated Systems [SiS] PCI-to-PCI bridge
00:0c.0 FireWire (IEEE 1394): Texas Instruments TSB43AB23 IEEE-1394a-2000 Controller (PHY/Link)
00:0f.0 Audio device: Silicon Integrated Systems [SiS] Azalia Audio Controller
01:00.0 VGA compatible controller: ATI Technologies Inc Cedar PRO [Radeon HD 5450]
01:00.1 Audio device: ATI Technologies Inc Manhattan HDMI Audio [Mobility Radeon HD 5000 Series]
calvin@fulcrum:~$ cat /proc/cpuinfo 
processor	: 0
vendor_id	: GenuineIntel
cpu family	: 15
model		: 6
model name	: Intel(R) Pentium(R) D CPU 3.00GHz
stepping	: 5
microcode	: 0xb
cpu MHz		: 2400.000
cache size	: 2048 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 2
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 6
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc pebs bts nopl pni dtes64 monitor ds_cpl est cid cx16 xtpr pdcm lahf_lm
bogomips	: 5999.87
clflush size	: 64
cache_alignment	: 128
address sizes	: 36 bits physical, 48 bits virtual
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 15
model		: 6
model name	: Intel(R) Pentium(R) D CPU 3.00GHz
stepping	: 5
microcode	: 0xb
cpu MHz		: 2400.000
cache size	: 2048 KB
physical id	: 0
siblings	: 2
core id		: 1
cpu cores	: 2
apicid		: 1
initial apicid	: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 6
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc pebs bts nopl pni dtes64 monitor ds_cpl est cid cx16 xtpr pdcm lahf_lm
bogomips	: 6000.29
clflush size	: 64
cache_alignment	: 128
address sizes	: 36 bits physical, 48 bits virtual
power management:

comment:45 Changed 6 years ago by umccullough

Cc: umccullough added

comment:46 Changed 6 years ago by pulkomandy

I hit this assertion each time I try to build Haiku now. I tried disabling SMP or limiting memory to 4GB in Haiku boot menu, with no success. The KDL happens near the end of Haiku build, usually when unzipping alternative-system-libs to the partition. This is rather annoying for me.

Haiku is installed on internal HD and building to internal HD, but the source is located on an USB-connected SSD. Not sure if that's of any help... lspci and lsusb output below.

00:00.0 Host bridge: Intel Corporation Mobile 4 Series Chipset Memory Controller Hub (rev 07)
00:02.0 VGA compatible controller: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 07)
00:02.1 Display controller: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 07)
00:03.0 Communication controller: Intel Corporation Mobile 4 Series Chipset MEI Controller (rev 07)
00:19.0 Ethernet controller: Intel Corporation 82567LM Gigabit Network Connection (rev 03)
00:1a.0 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #4 (rev 03)
00:1a.1 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #5 (rev 03)
00:1a.2 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #6 (rev 03)
00:1a.7 USB controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #2 (rev 03)
00:1b.0 Audio device: Intel Corporation 82801I (ICH9 Family) HD Audio Controller (rev 03)
00:1c.0 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 1 (rev 03)
00:1c.1 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 2 (rev 03)
00:1c.3 PCI bridge: Intel Corporation 82801I (ICH9 Family) PCI Express Port 4 (rev 03)
00:1d.0 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #1 (rev 03)
00:1d.1 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #2 (rev 03)
00:1d.2 USB controller: Intel Corporation 82801I (ICH9 Family) USB UHCI Controller #3 (rev 03)
00:1d.7 USB controller: Intel Corporation 82801I (ICH9 Family) USB2 EHCI Controller #1 (rev 03)
00:1e.0 PCI bridge: Intel Corporation 82801 Mobile PCI Bridge (rev 93)
00:1f.0 ISA bridge: Intel Corporation ICH9M-E LPC Interface Controller (rev 03)
00:1f.2 SATA controller: Intel Corporation 82801IBM/IEM (ICH9M/ICH9M-E) 4 port SATA Controller [AHCI mode] (rev 03)
00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 03)
03:00.0 Network controller: Intel Corporation Ultimate N WiFi Link 5300
Bus 000 Device 001: ID 0000:0000  
Bus 001 Device 001: ID 0000:0000  
Bus 002 Device 001: ID 0000:0000  
Bus 003 Device 001: ID 0000:0000  
Bus 004 Device 001: ID 0000:0000  
Bus 005 Device 001: ID 0000:0000  
Bus 006 Device 001: ID 0000:0000  
Bus 007 Device 001: ID 0000:0000  
Bus 007 Device 002: ID 152d:0602 JMicron Technology Corp. / JMicron USA Technology Corp. 

comment:47 Changed 6 years ago by kallisti5

Ok. I've been writing a few quick and dirty stress test applications for Haiku. One of which is my memory exercising program. The memeater application has successfully triggered the PAE after only a few runs.

https://github.com/kallisti5/haikuStress

Code should be mostly self-exclamatory

comment:48 in reply to:  47 Changed 6 years ago by bonefish

Replying to kallisti5:

Ok. I've been writing a few quick and dirty stress test applications for Haiku. One of which is my memory exercising program. The memeater application has successfully triggered the PAE after only a few runs.

With which parameters did you run the test and on which hardware configuration?

comment:49 Changed 6 years ago by kallisti5

When I generated the PAE exception I was doing 100 10 (create an area 100MB, write a small amount of data to it, delete it, 10 times) I also ran the test a few times before hand as well while testing the code.

The memory test could exercise the memory a lot more than it does already (maybe writing random data or something, I just ran out of time last night :)

I am 100% sure the application caused it. About mid-way through testing it KDL'ed

comment:50 Changed 6 years ago by kallisti5

An interesting note.. I was hoping to reproduce this in qemu so I could grab the serial output when the kdl occured.. however this bug doesn't seem to affect qemu.

  • qemu-system-i386 -m 6144 -hda haiku-nightly.image --enable-kvm -usbdevice tablet -smp 2
  • ran 6 instances of memeater 1000 10
  • watched memory usage go over 4GB and drop below it repeatedly for ~5 minutes.

No PAE crashes. This issue oddly seems tightly tied to real hardware.

comment:51 Changed 6 years ago by anevilyak

7 concurrent terminals each running memeater 1024 100 in a loop fails to reproduce it on my i7 8GB system here. Bringing it up to 8 concurrent instances results in them randomly crashing due to failing to allocate the area, but no KDL after an hour of doing this continuously.

comment:52 in reply to:  24 Changed 6 years ago by ttcoder

Replying to bonefish:

In your and Sean's comments above I actually read that it doesn't happen with only one CPU (one logical CPU). That would be consistent with the observations in #7904.

  • Only occurs when system memory is > 4GB (PAE active)

AFAICT no one has confirmed yet that on the same machine the problem does cease to occur with PAE disabled. #7904 demonstrates that a KDL with the analogous stack trace is possible with 32 bit paging. I would consider it more likely that the issue occurs only on specific CPU architectures or models. It would be nice, if everyone who has tried to reproduce the issue could list their corresponding CPU models for positive and negative results.

Positive on dsuden's Asus F2A55M system:

CPU(s): 2x AMD A4 5300
Memory: 3.45 GiB total, 270.04 MiB used
Haiku revision: hrev45824 Jul  5 2013 00:39:46 (BePC)

Not yet positive on my old thinkpad (Core2 Duo, 2 GB RAM); in my tests so far I always got the 'other' one, the messed-up slab allocator panic in #9858.

Also, it would be interesting to learn whether the machines that have the issue also test positive for #7904. Since that test is much quicker and more reliable, it would be more suitable for debugging.

Gotta try that, as well as disabling all CPUs but one.

Last edited 6 years ago by diver (previous) (diff)

comment:53 Changed 6 years ago by ttcoder

Cc: degea@… added

comment:54 Changed 6 years ago by pulkomandy

I'm still not sure how that could be related, but here the crash doesn't happen (or at least, not in noticeable amounts) when I'm not using my usb-attached SSD. I've been building webkit from the internal hard disk and everything is going fine, while trying to just git checkout it to the SSD will reliably trigger the panic.

Doing a checkout from the SSD to the internal drive also leads to a crash (this is writinto HD, reading from USB/SSD). Maybe the different I/O speed just triggers the bug more easily ? (either the slower transfer rate of USB, or the faster seek time of SSD).

comment:55 Changed 6 years ago by bonefish

Has a Patch: unset

comment:56 Changed 6 years ago by bonefish

6dee6653c26736f534abdda6886219ab56503533 of the PM branch addresses one cause. There may be others. Please test and report whether this fix helps in your case.

comment:57 Changed 6 years ago by bonefish

Just a reminder: In hrev46113 the PM branch was merged into the master. So that and any later revision contains the change referred to in the previous comment. Everyone who could fairly easily reproduce the problem, please report whether you still can. Thanks!

comment:58 Changed 6 years ago by bonefish

Any news on this one?

comment:59 Changed 6 years ago by ttcoder

Regarding dsuden and I, being stuck on all my leads I've asked him to use the nuclear option and outright send me a video of him ripping CDs (!) so that I see exactly what he does to trigger this KDL (and sometimes the corrupted slab KDL too) since he has a knack for triggering them quick. When all else fails, try brute force, right ??

We'll use the same Asus motherboard, the same haiku hrev (the buildbot resumed working so we'll probably use hrev46425), the same version of lame, same version of the ripping app, and -- hopefully -- I'll replicate his every mouse moves and clicks from observing his video footage. We won't have the same audio CDs to rip, but if the bug depends on THAT I'm ready to throw my keyboard through the window and follow it afterwards :o)

I'll ask him to start his video after a fresh warm boot, to ensure consistency and ease of replicability. Any other procedural suggestions, post them here to me...

If we are successful we might finally get a reproducible case in the greedy hands of you haiku devs (fingers crossed.. again)

Edit to add: No objection from me on closing this until proven otherwise; also, I should have mentionned http://dev.haiku-os.org/ticket/9858#comment:9 and comment:52 to make my comment clearer.

Last edited 6 years ago by ttcoder (previous) (diff)

comment:60 Changed 6 years ago by bonefish

@ttcoder: Please note that this is not #9858 (your CD ripping page fault(s)). It's about an assert being triggered, for which a fix has been in master since hrev46113. The fix is at least good for one of the causes (the one I encountered). There may be others and I'd like to know, if anyone has encountered this KDL in the meantime, or if this ticket can be closed.

comment:61 Changed 6 years ago by kallisti5

I get this on every boot (well, 3 out of 3 tries) from the latest gcc2h hrev46452 Haiku ISO image. I can provide a screenshot if needed

comment:62 in reply to:  61 Changed 6 years ago by bonefish

Replying to kallisti5:

I get this on every boot (well, 3 out of 3 tries) from the latest gcc2h hrev46452 Haiku ISO image. I can provide a screenshot if needed

Yes, please. It might also be helpful to know whether it's always the same virtual address and/or pte address.

Reproducibility is awesome. If you're willing to work on tracking this down, I could prepare some kernel tracing. Will probably take several iterations.

comment:63 Changed 6 years ago by kallisti5

Sorry for the delay, just saw this note. I'll boot from a recent CD some time this evening and get a screenshot of the kdl.

There is good news however! I haven't seen these PAE crashes lately (other than while booting from the ISO). All of the PAE crashes during os use have been replaced by the crashes in #10279 I guess application crashes are better than kernel crashes :-)

comment:64 Changed 5 years ago by humdinger

I'll join the KDL club with:

PANIC: ASSERT FAILED (src/system/kernel/arch/x86/paging/pae/X86VMTranslationMapPAE.cpp:422:
(*entry & 0x0000000000000001LL) == 0; virtual address: 0x1a6c000,
existing pte: 0x34e03fa034f03fd @ 0x81cda360

This is with hrev46690 after restarting the media services from the Media preferences a few times. Core2Duo, 2gb ram. pretty un-stressed system. Attached the KDL screenshot.

Changed 5 years ago by humdinger

Attachment: KDL.jpg added

kdl on hrev46690

Changed 5 years ago by oco

Attachment: port.txt added

comment:65 Changed 5 years ago by oco

I got the same error as humdinger using a 4 CPUs virtual box machine with nightly ISO image hrev46706. The problem is always reproducible.

PANIC: ASSERT FAILED (/Buildbot/nightly-uploader/workdir/haiku/src/system/kernel/arch/x86/paging/pae/X86VMTranslationMapPAE.cpp:422): (*entry & 0x0000000000000001LL) == 0; virtual address: 0x82009000, existing pte: 0x105c003 @ 0x82801048

Configuring the same virtual machine with 2 CPUs give me a different error, at a later boot stage :

PANIC: ASSERT FAILED (/Buildbot/nightly-uploader/workdir/haiku/src/system/kernel/scheduler/scheduler_thread.h:397): system_time() - fWentSleep >= 0

See attached serial trace from the virtual machine

Changed 5 years ago by oco

Attachment: port.2.txt added

Serial trace with 2 CPUs in my virtualbox machine

comment:66 Changed 5 years ago by bonefish

When you say the issue is always reproducible with 4 CPUs, what's your exact VM configuration (RAM mainly) and what do you have to do exactly?

The latter seems to be a newly introduced scheduler issue and I suspect that's what 59b9b52 (part of hrev46718) addresses.

comment:67 Changed 5 years ago by oco

The VM is configured with standard parameters (see attached .vbox file for details). I have tested different RAM setup from 512 Mo to 8192 Mo with the same result. I have the same problem with 3 or more CPUs in the VM.

There is a 10 Go empty disk file attached to the machine and the nightly ISO from which i boot.

The machine is hosted on a linux host (ubuntu) on a core i7 (4 CPUs / 8 threads).

For the second problem, i have already linked the corresponding trace to the other bug (#8345). I will test it later with hrev46720.

Changed 5 years ago by oco

comment:68 Changed 5 years ago by bonefish

And what do you have to do to reproduce the issue? Just boot or anything else?

comment:69 Changed 5 years ago by oco

Only booting from the ISO.

comment:70 Changed 5 years ago by SeanCollins

I can reproduce this issue simply by booting from a usb stick on my AMD hardware. attached screen shot attachment, its the best I could do with my crappy camera phone.

Last edited 5 years ago by SeanCollins (previous) (diff)

Changed 5 years ago by SeanCollins

Attachment: 20140121_214836.jpg added

crash on latest ggc2 hybrid hrev46726

Changed 5 years ago by Premislaus

Changed 5 years ago by Premislaus

Changed 5 years ago by Premislaus

Attachment: kvm_setup.txt added

comment:71 Changed 5 years ago by Premislaus

Reported to me by user xaxes, because registering on Trac don't work. hrev46767

This bug occurs, when is more than one processor.

comment:72 Changed 5 years ago by pdziepak

The scheduler merge indeed introduced a problem that resulted in an assertion fail just as described in this ticket early at boot. This should be fixed in hrev46779.

comment:73 Changed 5 years ago by kallisti5

This bug did exist pre-scheduler. Would be interesting to see if it still occurs on the pre-scheduler hardware.

I haven't seen it on recent AMD hardware, but I originally saw it on a Pentium 4 Dell Optiplex. (I can't really smp load test though due to #10279)

comment:74 in reply to:  73 Changed 5 years ago by bonefish

Replying to kallisti5:

This bug did exist pre-scheduler.

Note that this isn't/hasn't been just one bug. The assertion is triggered on an inconsistency in the page tables, which can have different causes. I fixed one five months ago, with at least another one remaining, and apparently the scheduler branch merge introduced a new one. Unfortunately we can't discriminate the causes just from the assertion message.

comment:75 Changed 5 years ago by kallisti5

Milestone: R1/beta1R1/alpha5
Priority: criticalhigh

Changed 5 years ago by ttcoder

Attachment: IMG_20140605_164833.jpg added

Customer report from earlier this week, he's still running hrev46104 so might be a symptom already fixed by bonefish in PM or otherwise not relevant

comment:76 Changed 5 years ago by kallisti5

I just saw this on a VM running on Intel hardware. (x86 gcc4, hrev47380)

kdebug> message
PANIC: ASSERT FAILED (/Buildbot/nightly-uploader/workdir/haiku/src/system/kernel/arch/x86/paging/pae/X86VMTranslationMapPAE.cpp:422): (*entry & 0x0000000000000001LL) == 0; virtual address: 0xb63000, existing pte: 0xe5bb6465c880ba3 @ 0x82153b18
kdebug> bt
stack trace for thread 10063 "bash"
    kernel stack: 0xcdeec000 to 0xcdef0000
      user stack: 0x70dcd000 to 0x71dcd000
frame               caller     <image>:function + offset
 0 cdeef9f0 (+  32) 80093763   <kernel_x86> invoke_command_trampoline(void*: 0xcdeefa64) + 0x15
 1 cdeefa10 (+  12) 80115b9e   <kernel_x86> arch_debug_call_with_fault_handler + 0x1b
 2 cdeefa1c (+  48) 800913f3   <kernel_x86> debug_call_with_fault_handler + 0x4e
 3 cdeefa4c (+  64) 8009391d   <kernel_x86> invoke_debugger_command + 0xc8
 4 cdeefa8c (+  64) 80093a36   <kernel_x86> invoke_pipe_segment(debugger_command_pipe*: 0x37, int32: 8132, char*: NULL) + 0x71
 5 cdeefacc (+  48) 80093afd   <kernel_x86> invoke_debugger_command_pipe + 0x89
 6 cdeefafc (+  64) 8009787b   <kernel_x86> ExpressionParser<0xcdeefbb8>::_ParseCommandPipe(int&: 0xcdeefbb4) + 0xc9f
 7 cdeefb3c (+  80) 8009d761   <kernel_x86> ExpressionParser<0xcdeefbb8>::EvaluateCommand(char const*: 0x801aa020 "bt", int&: 0xcdeefbb4) + 0xc09
 8 cdeefb8c (+ 208) 8009e7d1   <kernel_x86> evaluate_debug_command + 0x8a
 9 cdeefc5c (+  80) 80092281   <kernel_x86> kernel_debugger_loop(char const*: 0x2 "<???>", char const*: 0x801aeca8 "����h�~�", char*: 0xcdeefcec, int32: -2146884601) + 0x2e7
10 cdeefcac (+  64) 80092432   <kernel_x86> kernel_debugger_internal(char const*: 0x2 "<???>", char const*: 0x2 "<???>", char*: 0xcdeefd0c, int32: -2146884101) + 0x11e
11 cdeefcec (+  32) 80092616   <kernel_x86> panic + 0x3a
12 cdeefd0c (+  96) 8011f14c   <kernel_x86> X86VMTranslationMapPAE<0xcf61caa8>::Map(uint32: 0xb63000, int64: 187289600, uint32: 0x55 (85), uint32: 0x0 (0), vm_page_reservation*: 0xcdeefe38) + 0x134
13 cdeefd6c (+  80) 800f7472   <kernel_x86> map_page(VMArea*: 0x55, vm_page*: 0xcdeefe38, uint32: 0x0 (0), uint32: 0x0 (0), vm_page_reservation*: 0xcccccccc) + 0xd0
14 cdeefdbc (+ 224) 800fdb1f   <kernel_x86> vm_soft_fault(VMAddressSpace*: 0x1, uint32: 0x1 (1), false, false, true, vm_page**: 0x8, VMAreaWiredRange*: 0x1) + 0xbf3
15 cdeefe9c (+ 160) 8010233e   <kernel_x86> vm_page_fault + 0x2ab
16 cdeeff3c (+  96) 801245ee   <kernel_x86> x86_page_fault_exception + 0x163
17 cdeeff9c (+  12) 801181de   <kernel_x86> int_bottom_user + 0x73
user iframe at 0xcdeeffa8 (end = 0xcdef0000)
 eax 0x0           ebx 0xc1cd6c       ecx 0x71dcb1ac  edx 0x61ad2114
 esi 0x0           edi 0x71dcb1e0     ebp 0x71dcb1b8  esp 0xcdeeffdc
 eip 0xb63980   eflags 0x13212   user esp 0x71dcb19c
 vector: 0xe, error code: 0x1d
18 cdeeffa8 (+   0) 00b63980   <libroot.so> _init (nearest) + 0x1734
19 71dcb1b8 (+  80) 024c051e   <_APP_> make_child + 0xaa
20 71dcb208 (+ 256) 0249f495   <_APP_> command_substitute + 0x126
21 71dcb308 (+ 192) 024a39ed   <_APP_> expand_string (nearest) + 0x2af4
22 71dcb3c8 (+3472) 024a4b6a   <_APP_> expand_string (nearest) + 0x3c71
23 71dcc158 (+  32) 024a5a79   <_APP_> expand_string (nearest) + 0x4b80
24 71dcc178 (+  64) 024a5db1   <_APP_> expand_string_assignment + 0x70
25 71dcc1b8 (+2288) 024a080e   <_APP_> string_quote_removal (nearest) + 0x4ee
26 71dccaa8 (+  96) 024a097c   <_APP_> string_quote_removal (nearest) + 0x65c
27 71dccb08 (+  16) 024a0c89   <_APP_> do_word_assignment + 0x13
28 71dccb18 (+ 112) 024a6604   <_APP_> expand_word (nearest) + 0x168
29 71dccb88 (+  16) 024a6d77   <_APP_> expand_words + 0x13
30 71dccb98 (+ 320) 0248d108   <_APP_> execute_command_internal + 0x11c9
31 71dcccd8 (+  96) 024e4c35   <_APP_> parse_and_execute + 0x446
32 71dccd38 (+  64) 02483cd0   <_APP_> xparse_dolparen (nearest) + 0x6b0
33 71dccd78 (+ 304) 0248559d   <_APP_> main + 0x1216
34 71dccea8 (+  64) 024799eb   <_APP_> _start + 0x50
35 71dccee8 (+  48) 00e7b630   </boot/system/runtime_loader@0x00e6a000> <unknown> + 0x11630
36 71dccf18 (+   0) 61ad2250   <commpage> commpage_thread_exit + 0x00
Last edited 5 years ago by kallisti5 (previous) (diff)

comment:77 Changed 5 years ago by kallisti5

Environment: 4GB of RAM, running in kvm (libvirt) on a Intel i5

kdebug> in_context 10063 dis -b 10
0x00b6394b:       e920e9ffff	jmp 0xb62270
0x00b63950:     ffa3c0050000	jmp 0x5c0(%ebx)
0x00b63956:       68680b0000	push $0xb68
0x00b6395b:       e910e9ffff	jmp 0xb62270
0x00b63960:     ffa3c4050000	jmp 0x5c4(%ebx)
0x00b63966:       68700b0000	push $0xb70
0x00b6396b:       e900e9ffff	jmp 0xb62270
0x00b63970:     ffa3c8050000	jmp 0x5c8(%ebx)
0x00b63976:       68780b0000	push $0xb78
0x00b6397b:       e9f0e8ffff	jmp 0xb62270
0x00b63980:     ffa3cc050000	jmp 0x5cc(%ebx)   <<<
0x00b63986:       68800b0000	push $0xb80
0x00b6398b:       e9e0e8ffff	jmp 0xb62270
0x00b63990:     ffa3d0050000	jmp 0x5d0(%ebx)
0x00b63996:       68880b0000	push $0xb88
0x00b6399b:       e9d0e8ffff	jmp 0xb62270
0x00b639a0:     ffa3d4050000	jmp 0x5d4(%ebx)
0x00b639a6:       68900b0000	push $0xb90
0x00b639ab:       e9c0e8ffff	jmp 0xb62270
0x00b639b0:     ffa3d8050000	jmp 0x5d8(%ebx)
kdebug> 

kdebug> mapping 0x00b63980 10063
page directory: 0xe454e000 (PDPT[0])
page directory entry 5 (0xe454e028): 0x19a4f027
  access:  present writable user executable
  caching:
  flags:   accessed
page table: 0x19a4f000
page table entry 355 (phys: 0x19a4fb18): 0xe5bb6465c880ba3
  access:  present writable executable
  caching: PAT
  flags:   accessed
  address: 0xbb6465c880000


kdebug> mapping 0xb63000
page directory: 0xe454e000 (PDPT[0])
page directory entry 5 (0xe454e028): 0x19a4f027
  access:  present writable user executable
  caching:
  flags:   accessed
page table: 0x19a4f000
page table entry 355 (phys: 0x19a4fb18): 0xe5bb6465c880ba3
  access:  present writable executable
  caching: PAT
  flags:   accessed
  address: 0xbb6465c880000

kdebug> page_stats 
page stats:
total: 1179392
active: 5183 (busy: 0)
inactive: 16059 (busy: 0)
cached: 273643 (busy: 0)
unused: 131162 (busy: 0)
wired: 110978 (busy: 0)
modified: 20 (busy: 0)
free: 634253
clear: 8094
unreserved free pages: 642344
unsatisfied page reservations: 0
mapped pages: 111433
longest free pages run: 5080 pages (at 637167)
longest free/cached pages run: 5080 pages (at 637167)
waiting threads:

free queue: 0x801b6b00, count = 634253
clear queue: 0x801b6b18, count = 8094
modified queue: 0x801b6ad0, count = 20 (0 temporary, 0 swappable, inactive: 0)
active queue: 0x801b6aa0, count = 5183
inactive queue: 0x801b6ab8, count = 16059
cached queue: 0x801b6ae8, count = 273643



Last edited 5 years ago by kallisti5 (previous) (diff)

Changed 5 years ago by kallisti5

Attachment: areas.06192014.txt added

areas of kdl, Jun 19th, 2014

comment:78 Changed 5 years ago by bonefish

There is a certain chance that hrev48117 improves the situation. Before hrev48108 the pgtables array in the x86 kernel_archs structure would overflow, when the boot loader needed more than 4 page tables. Ticket #11377 shows that this can actually happen. The field after pgtables is virtual_end which is used (among other things) for unmapping address ranges used in the boot loader but no longer needed in the kernel. Overwriting this field could therefore cause some of those page table entries to be left over and later trigger the assert this ticket is about.

Long story short, if you could reproduce the assert, please leave a comment, whether you still can.

comment:79 Changed 5 years ago by axeld

Wouldn't it make sense to try out a revision between hrev48108 and hrev48117 if you could reproduce that bug, and see if the panic is triggered? Very nice catch, btw!

comment:80 Changed 5 years ago by oco

I still have the virtual machine that trigger my test case. I just checked that i am still able to reproduce the initial problem. I will check again when the nightly ISO including this change will be available.

comment:81 Changed 5 years ago by oco

While waiting for hrev48117, i just made a try with hrev48113 ISO (latest available version) : in the same virtual machine, it triggers the same error message as in ticket #11377.

comment:82 Changed 5 years ago by oco

Finally checked with hrev48123 ISO : it does not crash anymore.

Well done !

comment:83 Changed 5 years ago by bonefish

Resolution: fixed
Status: assignedclosed

Thanks! Closing. I have the feeling that we might see this ASSERT again. After all, it catches inconsistencies in the page tables which can (and did) have different causes. Since this ticket has become awfully long, I'd prefer a new ticket to keep track of remaining bugs should any still remain.

comment:84 Changed 5 years ago by pulkomandy

Milestone: R1/alpha5R1/beta1

Changed 4 years ago by dsuden

Attachment: 20150109_135535.jpg added

Still seeing that KDL in Haiku shredder 1 hrev48411 Dec 1 2014 14:26:35 BePC x86 Haiku

comment:85 Changed 4 years ago by pulkomandy

Quoting bonefish:

Since this ticket has become awfully long, I'd prefer a new ticket to keep track of remaining bugs should any still remain.

Please do this and open a new ticket. While the assertion you hit is the same, the root cause is likely different, and it's better to start fresh on investigating it.

comment:86 Changed 4 years ago by anevilyak

That's already been done, c.f. ticket #11744.

Note: See TracTickets for help on using tickets.