Opened 18 months ago

Closed 5 months ago

Last modified 5 months ago

#18536 closed bug (fixed)

Haiku won't boot after update (regression)

Reported by: vidrep Owned by: jessicah
Priority: normal Milestone: R1/beta5
Component: Drivers/ACPI Version: R1/beta4
Keywords: Cc:
Blocked By: #18454 Blocking: #18792
Platform: All

Description

Acer AspireXC-704

Previously installed Haiku installation fails to boot after update to latest nightly.

DriveSetup shows raw disk only and does not recognize previous Haiku partitions.

I have narrowed the breakage to:

hrev53033 working hrev53038 not working

listdev and syslog attached

Attachments (17)

listdev.txt (2.8 KB ) - added by vidrep 18 months ago.
syslog.txt (218.7 KB ) - added by vidrep 18 months ago.
IMG_0245.JPG (887.3 KB ) - added by vidrep 18 months ago.
IMG_0246.JPG (896.2 KB ) - added by vidrep 18 months ago.
IMG_0247.JPG (1.0 MB ) - added by vidrep 18 months ago.
syslog (378.6 KB ) - added by vidrep 14 months ago.
IMG_0127.JPG (701.8 KB ) - added by vidrep 14 months ago.
IMG_0126.JPG (952.2 KB ) - added by vidrep 10 months ago.
IMG_0128.JPG (745.9 KB ) - added by vidrep 10 months ago.
IMG_0129.JPG (818.7 KB ) - added by vidrep 10 months ago.
IMG_0126.2.JPG (884.6 KB ) - added by vidrep 10 months ago.
IMG_0127.2.JPG (1.1 MB ) - added by vidrep 10 months ago.
syslog.2 (104.0 KB ) - added by vidrep 6 months ago.
IMG_0041.JPG (921.2 KB ) - added by vidrep 6 months ago.
IMG_0045.JPG (957.3 KB ) - added by vidrep 6 months ago.
syslog.3 (115.6 KB ) - added by vidrep 6 months ago.
syslog-redysz-hrev57980 (301.5 KB ) - added by Redysz 5 months ago.
syslog-redysz-hrev57980

Change History (65)

by vidrep, 18 months ago

Attachment: listdev.txt added

by vidrep, 18 months ago

Attachment: syslog.txt added

comment:1 by Starcrasher, 18 months ago

I was puzzled by numbers and checked the joined sylog. Of course, it's between hrev57033 and hrev57038 but a lot happened during this period (Changes to XHCI, PCIe, usb-disk driver transition to new API..)... Could you precise where it hangs?

comment:2 by korli, 18 months ago

vidrep, can you enable on-screen logs and post the few last screenshots?

comment:3 by vidrep, 18 months ago

The syslog is from hrev57033 I've enabled on-screen debug logs and attached a couple of photos while booting (hrev57191).

by vidrep, 18 months ago

Attachment: IMG_0245.JPG added

by vidrep, 18 months ago

Attachment: IMG_0246.JPG added

by vidrep, 18 months ago

Attachment: IMG_0247.JPG added

comment:4 by waddlesplash, 18 months ago

Are you booting from USB or internal hard drive?

comment:5 by waddlesplash, 18 months ago

Component: Partitioning Systems/GPTDrivers/Disk/AHCI

<Vidrep_64> I'm booting from an internal hard drive

comment:6 by waddlesplash, 14 months ago

Presumably hrev57034 (note multiple commits in that hrev) is the cause here. So this would then be a problem with the new PCIe support code.

I've asked Vidrep to try booting with ACPI disabled, as that will use the older configuration mechanism and may reveal differences.

comment:7 by vidrep, 14 months ago

I was able to successfully install a current nightly build only if "disable ACPI" is first selected in the boot menu. Otherwise the system will KDL on the fourth icon during boot. It will KDL during shutdown regardless.

comment:8 by waddlesplash, 14 months ago

Component: Drivers/Disk/AHCIDrivers/ACPI

The problem is likely in the new ECAM PCI mechanism and the ACPI bus, then, based on the commit range.

comment:9 by waddlesplash, 14 months ago

Milestone: UnscheduledR1/beta5

comment:10 by vidrep, 14 months ago

syslog with ACPI enebled and reboot with it disabled

by vidrep, 14 months ago

Attachment: syslog added

comment:11 by vidrep, 14 months ago

System KDL when shutting down or rebooting Photo attached

by vidrep, 14 months ago

Attachment: IMG_0127.JPG added

comment:12 by waddlesplash, 12 months ago

Blocked By: 18454 added

Since the system boots with ACPI disabled, maybe related to #18454.

comment:13 by waddlesplash, 10 months ago

Please retest with a recent nightly.

comment:14 by vidrep, 10 months ago

Tested with hrev57645 x86_64

Fresh install booting from USB stick

Image on USB stick boots to installation screen, but the hard disk and its Haiku partitions are not seen unless ACPI is disabled first. After disabling ACPI and installing to HDD the system will KDL on the fourth icon upon reboot. Diable ACPI allows it to boot to the desktop. The system will KDL again when shutting down or rebooting.

comment:15 by pulkomandy, 10 months ago

Unfortunately I can't find any boot in the non-working case in your attached files.

I'm interested in the PCI initialization which starts like this (with all the "PCI: ..." following that):

From the hrev57033 log:

238	KERN: PCI: mechanism addr: e0000000, seg: 0, start: 0, end: ff
239	KERN: PCI: mechanism pcie controller found
240	KERN: pci_init_deferred()

From the other log:

4556	KERN: PCI: pci_module_init
4557	KERN: acpi: ACPI disabled
4558	KERN: PCI: mechanism 1 controller found

(I find only cases with ACPI disabled).

I think you can extract the syslog while booting from USB with ACPI enabled.

I'm hopping to see something like this in current nightlies:

KERN: initialize PCI controller from ACPI
KERN: PCI: range from ACPI [0(1),fe(1)] with length ff
KERN: PCI: range from ACPI [0(1),cf7(1)] with length cf8
KERN: PCI: range from ACPI [d00(1),ffff(1)] with length f300
KERN: PCI: range from ACPI [a0000(1),bffff(1)] with length 20000
KERN: PCI: range from ACPI [c0000(1),c3fff(1)] with length 0
KERN: PCI: range from ACPI [c4000(1),c7fff(1)] with length 0
KERN: PCI: range from ACPI [c8000(1),cbfff(1)] with length 0
KERN: PCI: range from ACPI [cc000(1),cffff(1)] with length 0
KERN: PCI: range from ACPI [d0000(1),d3fff(1)] with length 0
KERN: PCI: range from ACPI [d4000(1),d7fff(1)] with length 0
KERN: PCI: range from ACPI [d8000(1),dbfff(1)] with length 0
KERN: PCI: range from ACPI [dc000(1),dffff(1)] with length 0
KERN: PCI: range from ACPI [e0000(1),e3fff(1)] with length 0
KERN: PCI: range from ACPI [e4000(1),e7fff(1)] with length 0
KERN: PCI: range from ACPI [e8000(1),ebfff(1)] with length 0
KERN: PCI: range from ACPI [ec000(1),effff(1)] with length 0
KERN: PCI: range from ACPI [f0000(1),fffff(1)] with length 0
KERN: PCI: range from ACPI [80800000(1),bfffffff(1)] with length 3f800000
KERN: PCI: range from ACPI [4000000000(1),7fffffffff(1)] with length 4000000000
KERN: PCI: mechanism addr: c0000000, seg: 0, start: 0, end: ff

(we have added quite a bit of debug info in here)

I have also noticed this log:

io-apic 0 entry count exceeds max supported, only using the first 64 entries
io-apic 0 has range 0-63, 64 entries, version 0x00720020, apic-id 1

maybe we should do something about that?

comment:16 by vidrep, 10 months ago

Would enabling any of the debug options in the boot menu be of any help?

comment:17 by pulkomandy, 10 months ago

I don't know, I need a log where the problem actually happens to investigate it, and I don't think I found one in the attached logs (maybe I missed it). They are either #rom before the problem, or with acpi disabled.

comment:18 by vidrep, 10 months ago

hrev57666 x86_64 ACPI enabled

Enable on screen debug output - IMG.0126.JPG

Display current boot loader log - IMG.0128.JPG, IMG.0129.JPG

Hopefully this helps.

by vidrep, 10 months ago

Attachment: IMG_0126.JPG added

by vidrep, 10 months ago

Attachment: IMG_0128.JPG added

by vidrep, 10 months ago

Attachment: IMG_0129.JPG added

comment:19 by pulkomandy, 10 months ago

Ok, there are some strange values in the ACPI range list similar to #18454. So I guess we need to fix that first and see if it helps here. Waddleplash had already guessed it, but now I think it's confirmed.

comment:20 by vidrep, 10 months ago

I have attached a couple more photos of debug output, but this time with on screen paging disabled. I was able to capture the last few lines before it was overwritten by the KDL.

IMG_0126.2.JPG

IMG_0127.2.JPG

Last edited 10 months ago by vidrep (previous) (diff)

by vidrep, 10 months ago

Attachment: IMG_0126.2.JPG added

by vidrep, 10 months ago

Attachment: IMG_0127.2.JPG added

comment:21 by waddlesplash, 10 months ago

hrev57044 may also be related here then.

comment:22 by vidrep, 10 months ago

Updated to hrev57683. Same problem as before - cannot boot without first disabling ACPI.

comment:23 by waddlesplash, 6 months ago

Please retest after hrev57812. If it still fails, please capture a new syslog and upload it.

comment:24 by vidrep, 6 months ago

It's still the same. I've attached a syslog.

by vidrep, 6 months ago

Attachment: syslog.2 added

comment:25 by waddlesplash, 6 months ago

This boot has ACPI disabled, can you capture a failed boot with ACPI enabled, or at least the tail end of it via taking a picture of debug output?

comment:26 by vidrep, 6 months ago

Here's a photo with on-screen debugging enabled

by vidrep, 6 months ago

Attachment: IMG_0041.JPG added

comment:27 by vidrep, 6 months ago

Here's another one with paging disabled

Last edited 6 months ago by vidrep (previous) (diff)

by vidrep, 6 months ago

Attachment: IMG_0045.JPG added

comment:28 by korli, 6 months ago

Could you try to blocklist the sdhci bus driver in the bootloader?

comment:29 by vidrep, 6 months ago

Didn't work. Still KDL on the fourth icon.

comment:30 by korli, 6 months ago

OK. Could you try to capture the next page of log after IMG_0041.JPG?

comment:31 by vidrep, 6 months ago

“Press key to continue” has no effect. The debug output is frozen at this point.

Last edited 6 months ago by vidrep (previous) (diff)

comment:32 by pulkomandy, 6 months ago

In the bootloader options you can select "disable paging", that will let the system boot as far as possible without having to press a key between each page.

There is a known issue that the keyboard in these boot screens stops working when usb starts being initialized, so the page-by-page output is only possible for the steps before that.

comment:33 by vidrep, 6 months ago

I've already provided screen shots with paging disabled. It scrolls by very quickly then KDL's (see IMG_0045.JPG). Otherwise it stalls after one page and I cannot proceed further.

Last edited 6 months ago by vidrep (previous) (diff)

comment:34 by pulkomandy, 6 months ago

Ok, so, to clarify the current situation:

  • Problems with PCI enumeration are now solved
  • The boot fails with "cannot find any boot partitions" after several errors from SATA / AHCI ("unable to hard reset device") and a DMA transfer that doesn't seem to stop.

Maybe the easiest way to investigate this is booting from USB with ACPI enabled (if I understand correctly, this should work fine, but you won't have access to the hard disk) and getting a complete syslog from there?

We can try to compare the AHCI initialization with the one in syslog.2 and see if we can understand why it's not working when ACPI is enabled.

comment:35 by vidrep, 6 months ago

I wrote hrev57909_x86_64 to a USB thumb drive and booted from it with ACPI enabled. Attached is the syslog.

Last edited 6 months ago by vidrep (previous) (diff)

by vidrep, 6 months ago

Attachment: syslog.3 added

comment:36 by pulkomandy, 6 months ago

The first difference I see is in the MSI interrupts handling:

490	KERN: allocate_io_interrupt_vectors: allocated 1 vectors starting from 17
491	KERN: msi_allocate_vectors: allocated 1 vectors starting from 17
492	KERN: msi enabled: 0x0001
493	KERN: ahci: using MSI vector 17

now becomes:

498	KERN: allocate_io_interrupt_vectors: allocated 1 vectors starting from 65
499	KERN: msi_allocate_vectors: allocated 1 vectors starting from 65
500	KERN: msi enabled: 0x0001
501	KERN: ahci: using MSI vector 65

but then it goes completely wrong, here is a normal access in syslog.2:

497	KERN: ahci: cap: Interface Speed Support: generation 3
498	KERN: ahci: cap: Number of Command Slots: 32 (raw 0x1f)
499	KERN: ahci: cap: Number of Ports: 2 (raw 0x1)
500	KERN: ahci: cap: Supports Port Multiplier: no
501	KERN: ahci: cap: Supports External SATA: no
502	KERN: ahci: cap: Enclosure Management Supported: no
503	KERN: ahci: cap: FIS-based Switching Control: no
504	KERN: ahci: cap: Supports Command List Override: yes
505	KERN: ahci: cap: Supports Staggered Spin-up: no
506	KERN: ahci: cap: Supports Mechanical Presence Switch: yes
507	KERN: ahci: cap: Supports 64-bit Addressing: yes
508	KERN: ahci: cap: Supports Native Command Queuing: yes
509	KERN: ahci: cap: Supports SNotification Register: no
510	KERN: ahci: cap: Supports Command List Override: yes
511	KERN: ahci: cap: Supports AHCI mode only: yes
512	KERN: ahci: cap2: DevSleep Entrance from Slumber Only: yes
513	KERN: ahci: cap2: Supports Aggressive Device Sleep Management: yes
514	KERN: ahci: cap2: Supports Device Sleep: yes
515	KERN: ahci: cap2: Automatic Partial to Slumber Transitions: yes
516	KERN: ahci: cap2: NVMHCI Present: no
517	KERN: ahci: cap2: BIOS/OS Handoff: no
518	KERN: ahci: ghc: AHCI Enable: yes
519	KERN: ahci: Ports Implemented Mask: 0x000003 Number of Available Ports: 2
520	KERN: ahci: AHCI Version 0001.03.01 Interrupt 17

In syslog.3 we are reading only ff bytes from the device it seems, everything is set:

505	KERN: ahci: cap: Interface Speed Support: generation 15
506	KERN: ahci: cap: Number of Command Slots: 32 (raw 0x1f)
507	KERN: ahci: cap: Number of Ports: 32 (raw 0x1f)
508	KERN: ahci: cap: Supports Port Multiplier: yes
509	KERN: ahci: cap: Supports External SATA: yes
510	KERN: ahci: cap: Enclosure Management Supported: yes
511	KERN: ahci: cap: FIS-based Switching Control: yes
512	KERN: ahci: cap: Supports Command List Override: yes
513	KERN: ahci: cap: Supports Staggered Spin-up: yes
514	KERN: ahci: cap: Supports Mechanical Presence Switch: yes
515	KERN: ahci: cap: Supports 64-bit Addressing: yes
516	KERN: ahci: cap: Supports Native Command Queuing: yes
517	KERN: ahci: cap: Supports SNotification Register: yes
518	KERN: ahci: cap: Supports Command List Override: yes
519	KERN: ahci: cap: Supports AHCI mode only: yes
520	KERN: ahci: cap2: DevSleep Entrance from Slumber Only: yes
521	KERN: ahci: cap2: Supports Aggressive Device Sleep Management: yes
522	KERN: ahci: cap2: Supports Device Sleep: yes
523	KERN: ahci: cap2: Automatic Partial to Slumber Transitions: yes
524	KERN: ahci: cap2: NVMHCI Present: yes
525	KERN: ahci: cap2: BIOS/OS Handoff: yes
526	KERN: ahci: ghc: AHCI Enable: yes
527	KERN: ahci: Ports Implemented Mask: 0xffffffff Number of Available Ports: 32
528	KERN: ahci: AHCI Version ffff.ff.ff Interrupt 65

From there on, the AHCI driver is of course completely confused, trying to create 32 ports, and finding out that all of them have DMA transfers active (since the bit indicating that is also stuck to 1).

comment:37 by pulkomandy, 6 months ago

I found a possible issue in the revision range and made a fix for it: https://review.haiku-os.org/c/haiku/+/7929

I don't know if that will fix the problem, however.

comment:38 by waddlesplash, 6 months ago

Please retest after hrev57913.

comment:39 by vidrep, 6 months ago

I updated from hrev57910 to hrev57921 and there is no change. The PC still KDL's as before. Also, my idualwifi 7260 is not functioning after the update. Perhaps related to https://dev.haiku-os.org/ticket/18966

Last edited 6 months ago by vidrep (previous) (diff)

comment:40 by pulkomandy, 5 months ago

Blocking: 18792 added

comment:41 by pulkomandy, 5 months ago

I noticed something strange.

We are applying a quirk to all Intel devices to configure the ports. This is done in AHCIController::ResetController.

This code reads the port count and should log a message if the port count is greater than 8. It doesn't log anything ("don't know how to enable SATA ports 9 to %d").

In AHCIController::Init, a few lines after calling ResetController, we again read the two variables used to compute the number of ports. This time, we find out that there are 32 ports.

ahci: cap: Number of Ports: 32 (raw 0x1f)
Ports Implemented Mask: 0xffffffff Number of Available Ports: 32

This means, at the time ResetController was called, the device was operating fine. But during or after applying the quirk, it stopped working?

I also verified that Linux applies the same quirk to this device. https://github.com/torvalds/linux/blob/master/drivers/ata/ahci.c#L1770

I guess the problem then is that the new ECAM register writing code is not correct. Either there is a problem with doing a 32 bit access instead of a 16 bit one, or there is a problem with endianness and we end up accessing to the wrong register.

I made two changes:

I appreciate if you can test them (together or separately, as you prefer)

Last edited 5 months ago by pulkomandy (previous) (diff)

comment:42 by Redysz, 5 months ago

I only know how to test it from iso nightly build. So I downloaded hrev57980 revision (there are the fixes, right?) and I still cannot see my hard drive. Only my memory stick is visible. I'm attaching the syslog file.

by Redysz, 5 months ago

Attachment: syslog-redysz-hrev57980 added

syslog-redysz-hrev57980

comment:43 by pulkomandy, 5 months ago

No, the patch is not merged yet in the nightlies. That's why I linked to the patch, otherwise I would give a hrev number to test with.

From the patch, you can find the comment by the commit checker robot, and from there a link to the test build: https://haiku.movingborders.es/testbuild/I8854350b70793160548fe96e15d95874deed4f7b/2/hrev57979/x86_64/

There you can find an iso image with that change.

comment:44 by vidrep, 5 months ago

When booting the test image from a USB thumb drive, the installer now "sees" my hard drive with its Haiku partitions again. So, this looks to be the correct fix for this issue.

comment:45 by Redysz, 5 months ago

Yes, it also works for me. I can install Haiku. After installation it does not boot up, but (probably) this is another problem.

comment:46 by vidrep, 5 months ago

I’ll do an install to the HDD later today and report back any issues, if any.

comment:47 by waddlesplash, 5 months ago

Resolution: fixed
Status: newclosed

Fix merged in hrev57981. Thanks for testing!

comment:48 by vidrep, 5 months ago

Test image successfully installed to my existing HDD partition and rebooted without a KDL. Thank you!

Note: See TracTickets for help on using tickets.