Opened 17 years ago

Closed 17 years ago

#2021 closed bug (fixed)

AHCI fatal error when checking out the SVN tree on Haiku r24735

Reported by: ddew Owned by: marcusoverhagen
Priority: blocker Milestone: R1
Component: Drivers/Disk Version: R1/pre-alpha1
Keywords: Cc: bga, anevilyak
Blocked By: Blocking:
Platform: All

Description

When I tried checking out the full SVN tree on hrev24735 i get thrown into KDL after downloading a few hundred MB. The ahci controller is an ICH9.

PANIC: ahci fatal error, is 0x20000000

Attachments (3)

DSC02577.JPG (108.2 KB ) - added by bga 17 years ago.
Backtrace from KDL
syslog.txt (100.2 KB ) - added by bga 17 years ago.
Syslog output
sc_traced.jpg (67.0 KB ) - added by bga 17 years ago.
Panic + Stack Crawl + Traced

Download all attachments as: .zip

Change History (35)

comment:1 by axeld, 17 years ago

Component: - GeneralDrivers/Disk
Owner: changed from axeld to marcusoverhagen

comment:2 by marcusoverhagen, 17 years ago

Status: newassigned

Is this reproduceable? Are any other ahci messages written to serial port or syslog before this happened?

Description of PxIS bit 29 (0x20000000): Host Bus Fatal Error Status (HBFS): Indicates that the HBA encountered a host bus error that it cannot recover from, such as a bad software pointer. In PCI, such an indication would be a target or master abort.

comment:3 by bga, 17 years ago

Marked bug #2175 as a duplicate of this.

BTW, I have been getting this KDL after unpacking the Haiku tree inside itself and usually when the system is mostly idle (the crash is in the idle thread).

comment:4 by bga, 17 years ago

Cc: bga added

by bga, 17 years ago

Attachment: DSC02577.JPG added

Backtrace from KDL

comment:5 by bga, 17 years ago

Cc: axeld added

Just added the backtrace from KDL in case it may give some hints. I noticed the block cache is involved and as it seems data is being corrupted in at least some of the structures related to it (like the block notifier) I would not rule-out this being a side effect of a different problem. Specially because, as I mentioned, I only started getting this fairly recently.

Also adding Axel as cc, in case he has some idea about this.

comment:6 by stippi, 17 years ago

Cc: axeld removed

Please do not add Axel (or Ingo, korli, myself...) in CCs. We are on the general bug email list and are already receiving most bug notification emails twice. This has been said multiple times (although maybe only in bug comments).

comment:7 by marcusoverhagen, 17 years ago

bga and ddew, have those machines that you are using less than 4 GB?

I suspect that the PRD table in ahci_port.cpp might contain invalid pointers like 0xdeadbeef.

To find this, it might help to use kernel tracing in AHCIPort::FillPrdTable() to log the data that is present in the FLOW("FillPrdTable: prd-entry %u, addr %p, size %lu\n", *prdCount, address, bytes); statement.

I can't do that change right now.

comment:8 by bga, 17 years ago

Mine has exactly 4 Gb. *BUT*I have 2 video cards with 512 Mb each so from the total of 4096 Mb, only 3072 Mb shows as available memory.

I will check this for you but can´t do that before the end of the day when I am back home.

comment:9 by bga, 17 years ago

I added kernel tracing to AHCI port and it seems your intuition was partially right. it is not a invalid address (like 0xdeadbeef) that seems to be causing this error but any address above the 2048 Mb mark will do that. For instance, tracing showed that this error occured when the following entry was added to the prd table:

address: 0x80006000 size: 2048

In other words, any idea about why AHCI would fail if accessing addresses above 0x80000000?

comment:10 by bga, 17 years ago

Just as a side note, it is now clear why I only recently started getting this error. I used to have only 2 Gb of memory on my machine so AHCI could never access any address above this. Now I have 4 Gb as mentioned above.

comment:11 by marcusoverhagen, 17 years ago

Can you please attach a syslog with the PCI data dump?

Perhaps the ahci device is sitting behind a misconfigured PCI-PCI bridge and receives a PCI master abort.

comment:12 by anevilyak, 17 years ago

Cc: anevilyak added

by bga, 17 years ago

Attachment: syslog.txt added

Syslog output

comment:13 by bga, 17 years ago

Attached the syslog output from my machine. Let me know if you need anything else.

comment:14 by marcusoverhagen, 17 years ago

I have no idea right now what could be wrong. Is this really happening with all addresses > 2GB? Perhaps printing the bios e820 memory map to syslog will give some insight.

comment:15 by marcusoverhagen, 17 years ago

does removing the firewire driver fix the problem?

comment:16 by bga, 17 years ago

I have no idea if any address after 2 Gb results in the problem, but empirical evidence seems to point to this (all the time it crashed, the last address added to the prd table was > 2 Gb).

Also removing the firewire bus manager didn't help.

by bga, 17 years ago

Attachment: sc_traced.jpg added

Panic + Stack Crawl + Traced

comment:17 by bga, 17 years ago

Priority: normalblocker

Ok, I did several tests now and after around 8 attempts it seems obvious to me that it is not exactly failing with addresses above 2 Gb. It is failing with exactly the same address all the time (0x80006000). I just attached a new KDL capture with the output of the traced command (showing the past 30 entries on it). Note that the oast entry is the address mentioned and this happened every single time.

Considering we know that this address causes problems, would adding debug checking for this specific address somewhere help?

comment:18 by marcusoverhagen, 17 years ago

Please use hrev25306 and define TRACE_MEMORY_MAP and ENABLE_SERIAL to generate a new syslog / serial log with it. I don't think haiku is using the information correctly.

FYI, this is from my board (4GB)

extended memory info (from 0xe820):

base 0x00000000, len 0x0009fc00, type 1 (memory) base 0x0009fc00, len 0x00000400, type 2 (reserved) base 0x000e3000, len 0x0001d000, type 2 (reserved) base 0x00100000, len 0xcfeb0000, type 1 (memory) base 0xcffb0000, len 0x0000e000, type 3 (ACPI reclaim) base 0xcffbe000, len 0x00032000, type 4 (ACPI NVS) base 0xcfff0000, len 0x00010000, type 2 (reserved) base 0xfee00000, len 0x00001000, type 2 (reserved) base 0xffb00000, len 0x00500000, type 2 (reserved) base 0x100000000, len 0x30000000, type 1 (memory)

phys memory ranges:

base 0x00000000, length 0xcffb0000

allocated phys memory ranges:

base 0x0009f000, length 0x00321000

allocated virt memory ranges:

base 0x80000000, length 0x006c5000

comment:19 by bga, 17 years ago

Here is mine:

KERN: extended memory info (from 0xe820): KERN: base 0x00000000, len 0x0009ec00, type 1 (memory) KERN: base 0x0009ec00, len 0x00001400, type 2 (reserved) KERN: base 0x000e4000, len 0x0001c000, type 2 (reserved) KERN: base 0x00100000, len 0xbfe80000, type 1 (memory) KERN: base 0xbff80000, len 0x0000e000, type 3 (ACPI reclaim) KERN: base 0xbff8e000, len 0x00052000, type 4 (ACPI NVS) KERN: base 0xbffe0000, len 0x00020000, type 2 (reserved) KERN: base 0xe0000000, len 0x10000000, type 2 (reserved) KERN: base 0xfee00000, len 0x00001000, type 2 (reserved) KERN: base 0xffe00000, len 0x00200000, type 2 (reserved) KERN: base 0x100000000, len 0x40000000, type 1 (memory)

Funny enough, I didn't find the other information anywhere in the output (note I do not have a serial port, so I am checking syslog). I wonder if I missed a define somewhere...

comment:20 by bga, 17 years ago

Hi Marcus. Did the information I posted help at all? Also, any idea why the rest of the information was not printed? Any chance that, for some reason, that function is not being called at all in my configuration?

ddew, do you have a serial port on your machine so you can try to get more information?

comment:21 by ddew, 17 years ago

I do have a serial port on it but I still haven't had time to get hold of a serial cable, I'll hopefully be able to pick one up this weekend and do some testing.

in reply to:  20 comment:22 by jackburton, 17 years ago

Replying to bga:

Hi Marcus. Did the information I posted help at all? Also, any idea why the rest of the information was not printed? Any chance that, for some reason, that function is not being called at all in my configuration?

Some of the early boot logging doesn't end up in the syslog, it seems.

comment:23 by ddew, 17 years ago

Tried booting 25422 here and I still get bitten by the allocate_compage issue. As I can't readily remove ram to test the ahci bug I'll have to take a raincheck on capturing the log until I can boot Haiku without taking any out.

comment:24 by marcusoverhagen, 17 years ago

bga the memory doesn't show anything unusual, although I don't know why not all output is printed.

if possible, please use hrev25523 and run the new "pcistatus" command from KDL after the panic happened.

comment:25 by marcusoverhagen, 17 years ago

According to the syslog from 05/02/08, the pci latency settings on your system appear so be in a normal range. Nothing unusually large, and that shouldn't cause any PCI errors anyway, just might have an performance impact for the other devices with low latency settings.

I wasn't able to reproduce this so far, as bug #2059 always happens earlier here. Please try the pcistatus command next.

comment:26 by bga, 17 years ago

I tried pcistatus. Everything was right according to it except for this:

domain 0, bus 0, dev 31, func 2, PCI device status 0x22b8

Received Master-Abort

Let me know if you need the full output.

comment:27 by bga, 17 years ago

And, btw, I have been reproducing this bug in a different way now. What I do is zip up the entire source tree (probably any large enough zip file will do) and move it to the image. Then I unpack it there. I have to do it twice but 100% of the time, it will panic during the second time the file is being unzipped.

comment:28 by marcusoverhagen, 17 years ago

This at least confirms that the ahci controler experiences a PCI master abort.

KERN: PCI: [dom 0, bus  0] bus   0, device 31, function  2: vendor 8086, device 2922, revision 02
632 	KERN: PCI:   class_base 01, class_function 06, class_api 01
633 	KERN: PCI:   vendor 8086: Intel Corporation
634 	KERN: PCI:   device 2922: 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA AHCI Controller
635 	KERN: PCI:   info: Mass storage controller (SATA controller, AHCI 1.0)
636 	KERN: PCI:   line_size 00, latency 00, header_type 00, BIST 00
637 	KERN: PCI:   ROM base host 00000000, pci 00000000, size 00000000
638 	KERN: PCI:   cardbus_CIS 00000000, subsystem_id 8277, subsystem_vendor_id 1043
639 	KERN: PCI:   interrupt_line 03, interrupt_pin 02, min_grant 00, max_latency 00
640 	KERN: PCI:   base reg 0: host 00007c00, pci 00007c00, size 00000010, flags 09
641 	KERN: PCI:   base reg 1: host 00007880, pci 00007880, size 00000010, flags 0d
642 	KERN: PCI:   base reg 2: host 00007800, pci 00007800, size 00000010, flags 09
643 	KERN: PCI:   base reg 3: host 00007480, pci 00007480, size 00000010, flags 0d
644 	KERN: PCI:   base reg 4: host 00007400, pci 00007400, size 00000020, flags 01
645 	KERN: PCI:   base reg 5: host fe5fe800, pci fe5fe800, size 00000800, flags 00
646 	KERN: PCI:   Capabilities: MSI, PM, SATA, vendspec

comment:29 by marcusoverhagen, 17 years ago

I was finally able to reproduce this, too.

PANIC: ahci fatal error, is 0x20000000 Welcome to Kernel Debugging Land... Running on CPU 0 kdebug> pcistatus domain 0, bus 0, dev 0, func 0, PCI device status 0x1090

Received Target-Abort

domain 0, bus 0, dev 1, func 0, PCI device status 0x0010 domain 0, bus 0, dev 1, func 0, PCI bridge secondary status 0x0000 domain 0, bus 0, dev 1, func 0, PCI bridge control 0x001b domain 0, bus 1, dev 0, func 0, PCI device status 0x0018 domain 0, bus 1, dev 0, func 1, PCI device status 0x0010 domain 0, bus 0, dev 26, func 0, PCI device status 0x0290 domain 0, bus 0, dev 26, func 1, PCI device status 0x0290 domain 0, bus 0, dev 26, func 7, PCI device status 0x0290 domain 0, bus 0, dev 27, func 0, PCI device status 0x0010 domain 0, bus 0, dev 28, func 0, PCI device status 0x0010 domain 0, bus 0, dev 28, func 0, PCI bridge secondary status 0x0000 domain 0, bus 0, dev 28, func 0, PCI bridge control 0x0003 domain 0, bus 0, dev 28, func 4, PCI device status 0x0010 domain 0, bus 0, dev 28, func 4, PCI bridge secondary status 0x0000 domain 0, bus 0, dev 28, func 4, PCI bridge control 0x0003 domain 0, bus 0, dev 28, func 5, PCI device status 0x0010 domain 0, bus 0, dev 28, func 5, PCI bridge secondary status 0x0000 domain 0, bus 0, dev 28, func 5, PCI bridge control 0x0003 domain 0, bus 0, dev 29, func 0, PCI device status 0x0290 domain 0, bus 0, dev 29, func 1, PCI device status 0x0290 domain 0, bus 0, dev 29, func 2, PCI device status 0x0298 domain 0, bus 0, dev 29, func 3, PCI device status 0x0290 domain 0, bus 0, dev 29, func 7, PCI device status 0x0290 domain 0, bus 0, dev 30, func 0, PCI device status 0x0810

Signalled Target-Abort

domain 0, bus 0, dev 30, func 0, PCI bridge secondary status 0x2280

Received Master-Abort

domain 0, bus 0, dev 30, func 0, PCI bridge control 0x0823 domain 0, bus 5, dev 0, func 0, PCI device status 0x0290 domain 0, bus 5, dev 1, func 0, PCI device status 0x0230 domain 0, bus 0, dev 31, func 0, PCI device status 0x0210 domain 0, bus 0, dev 31, func 2, PCI device status 0x22b8

Received Master-Abort

domain 0, bus 0, dev 31, func 3, PCI device status 0x0280

kdebug> traced *

  1. [ 7] 566882313: ahci port - prd table - controller: 0x90d65240, index: 2, address: 0x7fff5000, size: 4096
  2. [ 7] 566883592: ahci port - prd table - controller: 0x90d65240, index: 2, address: 0x80004000, size: 4096

printed 30 entries within range 420210 to 420239 (30 of 420239 total, 420239 ever)

PCI: [dom 0, bus 0] bus 0, device 31, function 2: vendor 8086, device 2922, revision 02 PCI: class_base 01, class_function 06, class_api 01 PCI: vendor 8086: Intel Corporation PCI: device 2922: 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA AHCI Controller PCI: info: Mass storage controller (SATA controller, AHCI 1.0) PCI: line_size 00, latency 00, header_type 00, BIST 00 PCI: ROM base host 00000000, pci 00000000, size 00000000 PCI: cardbus_CIS 00000000, subsystem_id 7345, subsystem_vendor_id 1462 PCI: interrupt_line 0b, interrupt_pin 02, min_grant 00, max_latency 00 PCI: base reg 0: host 0000c000, pci 0000c000, size 00000010, flags 09 PCI: base reg 1: host 0000bc00, pci 0000bc00, size 00000010, flags 0d PCI: base reg 2: host 0000b880, pci 0000b880, size 00000010, flags 09 PCI: base reg 3: host 0000b800, pci 0000b800, size 00000010, flags 0d PCI: base reg 4: host 0000b480, pci 0000b480, size 00000020, flags 01 PCI: base reg 5: host dffff000, pci dffff000, size 00000800, flags

comment:30 by marcusoverhagen, 17 years ago

I found the problem:

PRD table:

kdebug> dw 0x8016d580 [0x8016d580] .p.............. 80007000 ffffffff 00000000 00000fff kdebug> dw 0x8016d590 [0x8016d590] .àÇ............. 09c7e000 00000000 00000000 000007ff

comment:31 by marcusoverhagen, 17 years ago

some test code:

#include <stdio.h>

typedef unsigned long addr_t; typedef unsigned long long uint64;

int main() {

char *p = (char *)0x80004000; printf("p = %p\n", p); printf("(unsigned long) = %lx\n", (unsigned long)p); printf("(uint64) = %llx\n", (uint64)p); printf("(uint64) (addr_t) = %llx\n", (uint64)(addr_t)p); printf("(uint64) (addr_t) >> 32 = %llx\n", ((uint64)(addr_t)p) >> 32); printf("(addr_t) >> 32 = %lx\n", ((addr_t)p) >> 32);

}

marcus@dagobert:~/develop/haiku/src$ gcc pointer.c pointer.c: In function ‘main’: pointer.c:11: warning: cast from pointer to integer of different size pointer.c:14: warning: right shift count >= width of type

marcus@dagobert:~/develop/haiku/src$ ./a.out p = 0x80004000 (unsigned long) = 80004000 (uint64) = ffffffff80004000 (uint64) (addr_t) = 80004000 (uint64) (addr_t) >> 32 = 0 (addr_t) >> 32 = 80004000

comment:32 by marcusoverhagen, 17 years ago

Resolution: fixed
Status: assignedclosed

The bug is fixed in hrev25557.

Note: See TracTickets for help on using tickets.