Opened 17 years ago
Closed 17 years ago
#2021 closed bug (fixed)
AHCI fatal error when checking out the SVN tree on Haiku r24735
Reported by: | ddew | Owned by: | marcusoverhagen |
---|---|---|---|
Priority: | blocker | Milestone: | R1 |
Component: | Drivers/Disk | Version: | R1/pre-alpha1 |
Keywords: | Cc: | bga, anevilyak | |
Blocked By: | Blocking: | ||
Platform: | All |
Description
When I tried checking out the full SVN tree on hrev24735 i get thrown into KDL after downloading a few hundred MB. The ahci controller is an ICH9.
PANIC: ahci fatal error, is 0x20000000
Attachments (3)
Change History (35)
comment:1 by , 17 years ago
Component: | - General → Drivers/Disk |
---|---|
Owner: | changed from | to
comment:2 by , 17 years ago
Status: | new → assigned |
---|
comment:3 by , 17 years ago
Marked bug #2175 as a duplicate of this.
BTW, I have been getting this KDL after unpacking the Haiku tree inside itself and usually when the system is mostly idle (the crash is in the idle thread).
comment:4 by , 17 years ago
Cc: | added |
---|
comment:5 by , 17 years ago
Cc: | added |
---|
Just added the backtrace from KDL in case it may give some hints. I noticed the block cache is involved and as it seems data is being corrupted in at least some of the structures related to it (like the block notifier) I would not rule-out this being a side effect of a different problem. Specially because, as I mentioned, I only started getting this fairly recently.
Also adding Axel as cc, in case he has some idea about this.
comment:6 by , 17 years ago
Cc: | removed |
---|
Please do not add Axel (or Ingo, korli, myself...) in CCs. We are on the general bug email list and are already receiving most bug notification emails twice. This has been said multiple times (although maybe only in bug comments).
comment:7 by , 17 years ago
bga and ddew, have those machines that you are using less than 4 GB?
I suspect that the PRD table in ahci_port.cpp might contain invalid pointers like 0xdeadbeef.
To find this, it might help to use kernel tracing in AHCIPort::FillPrdTable() to log the data that is present in the FLOW("FillPrdTable: prd-entry %u, addr %p, size %lu\n", *prdCount, address, bytes); statement.
I can't do that change right now.
comment:8 by , 17 years ago
Mine has exactly 4 Gb. *BUT*I have 2 video cards with 512 Mb each so from the total of 4096 Mb, only 3072 Mb shows as available memory.
I will check this for you but can´t do that before the end of the day when I am back home.
comment:9 by , 17 years ago
I added kernel tracing to AHCI port and it seems your intuition was partially right. it is not a invalid address (like 0xdeadbeef) that seems to be causing this error but any address above the 2048 Mb mark will do that. For instance, tracing showed that this error occured when the following entry was added to the prd table:
address: 0x80006000 size: 2048
In other words, any idea about why AHCI would fail if accessing addresses above 0x80000000?
comment:10 by , 17 years ago
Just as a side note, it is now clear why I only recently started getting this error. I used to have only 2 Gb of memory on my machine so AHCI could never access any address above this. Now I have 4 Gb as mentioned above.
comment:11 by , 17 years ago
Can you please attach a syslog with the PCI data dump?
Perhaps the ahci device is sitting behind a misconfigured PCI-PCI bridge and receives a PCI master abort.
comment:12 by , 17 years ago
Cc: | added |
---|
comment:13 by , 17 years ago
Attached the syslog output from my machine. Let me know if you need anything else.
comment:14 by , 17 years ago
I have no idea right now what could be wrong. Is this really happening with all addresses > 2GB? Perhaps printing the bios e820 memory map to syslog will give some insight.
comment:16 by , 17 years ago
I have no idea if any address after 2 Gb results in the problem, but empirical evidence seems to point to this (all the time it crashed, the last address added to the prd table was > 2 Gb).
Also removing the firewire bus manager didn't help.
comment:17 by , 17 years ago
Priority: | normal → blocker |
---|
Ok, I did several tests now and after around 8 attempts it seems obvious to me that it is not exactly failing with addresses above 2 Gb. It is failing with exactly the same address all the time (0x80006000). I just attached a new KDL capture with the output of the traced command (showing the past 30 entries on it). Note that the oast entry is the address mentioned and this happened every single time.
Considering we know that this address causes problems, would adding debug checking for this specific address somewhere help?
comment:18 by , 17 years ago
Please use hrev25306 and define TRACE_MEMORY_MAP and ENABLE_SERIAL to generate a new syslog / serial log with it. I don't think haiku is using the information correctly.
FYI, this is from my board (4GB)
extended memory info (from 0xe820):
base 0x00000000, len 0x0009fc00, type 1 (memory) base 0x0009fc00, len 0x00000400, type 2 (reserved) base 0x000e3000, len 0x0001d000, type 2 (reserved) base 0x00100000, len 0xcfeb0000, type 1 (memory) base 0xcffb0000, len 0x0000e000, type 3 (ACPI reclaim) base 0xcffbe000, len 0x00032000, type 4 (ACPI NVS) base 0xcfff0000, len 0x00010000, type 2 (reserved) base 0xfee00000, len 0x00001000, type 2 (reserved) base 0xffb00000, len 0x00500000, type 2 (reserved) base 0x100000000, len 0x30000000, type 1 (memory)
phys memory ranges:
base 0x00000000, length 0xcffb0000
allocated phys memory ranges:
base 0x0009f000, length 0x00321000
allocated virt memory ranges:
base 0x80000000, length 0x006c5000
comment:19 by , 17 years ago
Here is mine:
KERN: extended memory info (from 0xe820): KERN: base 0x00000000, len 0x0009ec00, type 1 (memory) KERN: base 0x0009ec00, len 0x00001400, type 2 (reserved) KERN: base 0x000e4000, len 0x0001c000, type 2 (reserved) KERN: base 0x00100000, len 0xbfe80000, type 1 (memory) KERN: base 0xbff80000, len 0x0000e000, type 3 (ACPI reclaim) KERN: base 0xbff8e000, len 0x00052000, type 4 (ACPI NVS) KERN: base 0xbffe0000, len 0x00020000, type 2 (reserved) KERN: base 0xe0000000, len 0x10000000, type 2 (reserved) KERN: base 0xfee00000, len 0x00001000, type 2 (reserved) KERN: base 0xffe00000, len 0x00200000, type 2 (reserved) KERN: base 0x100000000, len 0x40000000, type 1 (memory)
Funny enough, I didn't find the other information anywhere in the output (note I do not have a serial port, so I am checking syslog). I wonder if I missed a define somewhere...
follow-up: 22 comment:20 by , 17 years ago
Hi Marcus. Did the information I posted help at all? Also, any idea why the rest of the information was not printed? Any chance that, for some reason, that function is not being called at all in my configuration?
ddew, do you have a serial port on your machine so you can try to get more information?
comment:21 by , 17 years ago
I do have a serial port on it but I still haven't had time to get hold of a serial cable, I'll hopefully be able to pick one up this weekend and do some testing.
comment:22 by , 17 years ago
Replying to bga:
Hi Marcus. Did the information I posted help at all? Also, any idea why the rest of the information was not printed? Any chance that, for some reason, that function is not being called at all in my configuration?
Some of the early boot logging doesn't end up in the syslog, it seems.
comment:23 by , 17 years ago
Tried booting 25422 here and I still get bitten by the allocate_compage issue. As I can't readily remove ram to test the ahci bug I'll have to take a raincheck on capturing the log until I can boot Haiku without taking any out.
comment:24 by , 17 years ago
bga the memory doesn't show anything unusual, although I don't know why not all output is printed.
if possible, please use hrev25523 and run the new "pcistatus" command from KDL after the panic happened.
comment:25 by , 17 years ago
According to the syslog from 05/02/08, the pci latency settings on your system appear so be in a normal range. Nothing unusually large, and that shouldn't cause any PCI errors anyway, just might have an performance impact for the other devices with low latency settings.
I wasn't able to reproduce this so far, as bug #2059 always happens earlier here. Please try the pcistatus command next.
comment:26 by , 17 years ago
I tried pcistatus. Everything was right according to it except for this:
domain 0, bus 0, dev 31, func 2, PCI device status 0x22b8
Received Master-Abort
Let me know if you need the full output.
comment:27 by , 17 years ago
And, btw, I have been reproducing this bug in a different way now. What I do is zip up the entire source tree (probably any large enough zip file will do) and move it to the image. Then I unpack it there. I have to do it twice but 100% of the time, it will panic during the second time the file is being unzipped.
comment:28 by , 17 years ago
This at least confirms that the ahci controler experiences a PCI master abort.
KERN: PCI: [dom 0, bus 0] bus 0, device 31, function 2: vendor 8086, device 2922, revision 02 632 KERN: PCI: class_base 01, class_function 06, class_api 01 633 KERN: PCI: vendor 8086: Intel Corporation 634 KERN: PCI: device 2922: 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA AHCI Controller 635 KERN: PCI: info: Mass storage controller (SATA controller, AHCI 1.0) 636 KERN: PCI: line_size 00, latency 00, header_type 00, BIST 00 637 KERN: PCI: ROM base host 00000000, pci 00000000, size 00000000 638 KERN: PCI: cardbus_CIS 00000000, subsystem_id 8277, subsystem_vendor_id 1043 639 KERN: PCI: interrupt_line 03, interrupt_pin 02, min_grant 00, max_latency 00 640 KERN: PCI: base reg 0: host 00007c00, pci 00007c00, size 00000010, flags 09 641 KERN: PCI: base reg 1: host 00007880, pci 00007880, size 00000010, flags 0d 642 KERN: PCI: base reg 2: host 00007800, pci 00007800, size 00000010, flags 09 643 KERN: PCI: base reg 3: host 00007480, pci 00007480, size 00000010, flags 0d 644 KERN: PCI: base reg 4: host 00007400, pci 00007400, size 00000020, flags 01 645 KERN: PCI: base reg 5: host fe5fe800, pci fe5fe800, size 00000800, flags 00 646 KERN: PCI: Capabilities: MSI, PM, SATA, vendspec
comment:29 by , 17 years ago
I was finally able to reproduce this, too.
PANIC: ahci fatal error, is 0x20000000 Welcome to Kernel Debugging Land... Running on CPU 0 kdebug> pcistatus domain 0, bus 0, dev 0, func 0, PCI device status 0x1090
Received Target-Abort
domain 0, bus 0, dev 1, func 0, PCI device status 0x0010 domain 0, bus 0, dev 1, func 0, PCI bridge secondary status 0x0000 domain 0, bus 0, dev 1, func 0, PCI bridge control 0x001b domain 0, bus 1, dev 0, func 0, PCI device status 0x0018 domain 0, bus 1, dev 0, func 1, PCI device status 0x0010 domain 0, bus 0, dev 26, func 0, PCI device status 0x0290 domain 0, bus 0, dev 26, func 1, PCI device status 0x0290 domain 0, bus 0, dev 26, func 7, PCI device status 0x0290 domain 0, bus 0, dev 27, func 0, PCI device status 0x0010 domain 0, bus 0, dev 28, func 0, PCI device status 0x0010 domain 0, bus 0, dev 28, func 0, PCI bridge secondary status 0x0000 domain 0, bus 0, dev 28, func 0, PCI bridge control 0x0003 domain 0, bus 0, dev 28, func 4, PCI device status 0x0010 domain 0, bus 0, dev 28, func 4, PCI bridge secondary status 0x0000 domain 0, bus 0, dev 28, func 4, PCI bridge control 0x0003 domain 0, bus 0, dev 28, func 5, PCI device status 0x0010 domain 0, bus 0, dev 28, func 5, PCI bridge secondary status 0x0000 domain 0, bus 0, dev 28, func 5, PCI bridge control 0x0003 domain 0, bus 0, dev 29, func 0, PCI device status 0x0290 domain 0, bus 0, dev 29, func 1, PCI device status 0x0290 domain 0, bus 0, dev 29, func 2, PCI device status 0x0298 domain 0, bus 0, dev 29, func 3, PCI device status 0x0290 domain 0, bus 0, dev 29, func 7, PCI device status 0x0290 domain 0, bus 0, dev 30, func 0, PCI device status 0x0810
Signalled Target-Abort
domain 0, bus 0, dev 30, func 0, PCI bridge secondary status 0x2280
Received Master-Abort
domain 0, bus 0, dev 30, func 0, PCI bridge control 0x0823 domain 0, bus 5, dev 0, func 0, PCI device status 0x0290 domain 0, bus 5, dev 1, func 0, PCI device status 0x0230 domain 0, bus 0, dev 31, func 0, PCI device status 0x0210 domain 0, bus 0, dev 31, func 2, PCI device status 0x22b8
Received Master-Abort
domain 0, bus 0, dev 31, func 3, PCI device status 0x0280
kdebug> traced *
- [ 7] 566882313: ahci port - prd table - controller: 0x90d65240, index: 2, address: 0x7fff5000, size: 4096
- [ 7] 566883592: ahci port - prd table - controller: 0x90d65240, index: 2, address: 0x80004000, size: 4096
printed 30 entries within range 420210 to 420239 (30 of 420239 total, 420239 ever)
PCI: [dom 0, bus 0] bus 0, device 31, function 2: vendor 8086, device 2922, revision 02 PCI: class_base 01, class_function 06, class_api 01 PCI: vendor 8086: Intel Corporation PCI: device 2922: 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA AHCI Controller PCI: info: Mass storage controller (SATA controller, AHCI 1.0) PCI: line_size 00, latency 00, header_type 00, BIST 00 PCI: ROM base host 00000000, pci 00000000, size 00000000 PCI: cardbus_CIS 00000000, subsystem_id 7345, subsystem_vendor_id 1462 PCI: interrupt_line 0b, interrupt_pin 02, min_grant 00, max_latency 00 PCI: base reg 0: host 0000c000, pci 0000c000, size 00000010, flags 09 PCI: base reg 1: host 0000bc00, pci 0000bc00, size 00000010, flags 0d PCI: base reg 2: host 0000b880, pci 0000b880, size 00000010, flags 09 PCI: base reg 3: host 0000b800, pci 0000b800, size 00000010, flags 0d PCI: base reg 4: host 0000b480, pci 0000b480, size 00000020, flags 01 PCI: base reg 5: host dffff000, pci dffff000, size 00000800, flags
comment:30 by , 17 years ago
I found the problem:
PRD table:
kdebug> dw 0x8016d580 [0x8016d580] .p.............. 80007000 ffffffff 00000000 00000fff kdebug> dw 0x8016d590 [0x8016d590] .àÇ............. 09c7e000 00000000 00000000 000007ff
comment:31 by , 17 years ago
some test code:
#include <stdio.h>
typedef unsigned long addr_t; typedef unsigned long long uint64;
int main() {
char *p = (char *)0x80004000; printf("p = %p\n", p); printf("(unsigned long) = %lx\n", (unsigned long)p); printf("(uint64) = %llx\n", (uint64)p); printf("(uint64) (addr_t) = %llx\n", (uint64)(addr_t)p); printf("(uint64) (addr_t) >> 32 = %llx\n", ((uint64)(addr_t)p) >> 32); printf("(addr_t) >> 32 = %lx\n", ((addr_t)p) >> 32);
}
marcus@dagobert:~/develop/haiku/src$ gcc pointer.c pointer.c: In function ‘main’: pointer.c:11: warning: cast from pointer to integer of different size pointer.c:14: warning: right shift count >= width of type
marcus@dagobert:~/develop/haiku/src$ ./a.out p = 0x80004000 (unsigned long) = 80004000 (uint64) = ffffffff80004000 (uint64) (addr_t) = 80004000 (uint64) (addr_t) >> 32 = 0 (addr_t) >> 32 = 80004000
comment:32 by , 17 years ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
The bug is fixed in hrev25557.
Is this reproduceable? Are any other ahci messages written to serial port or syslog before this happened?
Description of PxIS bit 29 (0x20000000): Host Bus Fatal Error Status (HBFS): Indicates that the HBA encountered a host bus error that it cannot recover from, such as a bad software pointer. In PCI, such an indication would be a target or master abort.