Opened 10 years ago

Closed 2 months ago

#11020 closed bug (not reproducible)

Boot hangs on rocket

Reported by: jstressman Owned by: pdziepak
Priority: normal Milestone: Unscheduled
Component: Drivers/Power Version: R1/Development
Keywords: boot-failure Cc:
Blocked By: Blocking:
Platform: All

Description

Revisions up to and including hrev46677 boot properly for me. But hrev46732 and above boot as far as the rocket and then just hang there.

I've safe mode booted and tried safe mode video, but that didn't help. I displayed debug output to the screen and disabled pageination and the output stops right after the message about:

REG: failed to open shadow passwd DB file "/etc/shadow": No such file or directory

That's the last line of the debug output. It just sits there after that and doesn't boot (I waited like 20 minutes and nothing had happened yet, multiple boots of that and other recent revisions had the same behavior.)

I'll attach my listdev output and syslog(s) from my working hrev46677 install.

Attachments (3)

listdev.txt (4.8 KB ) - added by jstressman 10 years ago.
syslog.old (512.0 KB ) - added by jstressman 10 years ago.
syslog.txt (481.4 KB ) - added by jstressman 10 years ago.
Second "half" of the syslog. See syslog.old for the first "half".

Download all attachments as: .zip

Change History (33)

by jstressman, 10 years ago

Attachment: listdev.txt added

by jstressman, 10 years ago

Attachment: syslog.old added

by jstressman, 10 years ago

Attachment: syslog.txt added

Second "half" of the syslog. See syslog.old for the first "half".

comment:1 by luroh, 10 years ago

Blocking: 7665 added

comment:2 by waddlesplash, 10 years ago

Component: - GeneralSystem/Kernel
Milestone: R1R1/beta1
Owner: changed from nobody to axeld
Priority: normalhigh

Still an issue, according to jstressman on IRC.

Looking at those syslogs, the line mentioned in the description is there:

4921	KERN: REG: Failed to open shadow passwd DB file "/etc/shadow": No such file or directory

But it doesn't stop the boot, oddly enough.

Could you bisect further to find the exact revision that causes the problem? You'll have to build those revisions yourself.

comment:3 by anevilyak, 10 years ago

Component: System/KernelSystem
Owner: changed from axeld to nobody
Priority: highnormal

Without a more specific revision, there's currently no information that specifically implicates the kernel. That stage of boot is when all the userland servers are being loaded anyways.

comment:4 by jstressman, 10 years ago

I'll see if I can track down a more specific revision. But in the meantime I accidentally stumbled across a work-around.

I plugged in a USB SD card adapter while it was sitting there frozen at the previously mentioned /etc/shadow line, and suddenly the boot resumed and finished successfully.

KERN:   mtrr:  0: base: 0xbf780000, size:    0x80000, type: 0
KERN:   mtrr:  1: base: 0xc0000000, size: 0x40000000, type: 0
KERN: add_memory_type_range(-1, 0x100000000, 0x240000000, 6)
KERN: set MTRRs to:
KERN:   mtrr:  0: base: 0xbf780000, size:    0x80000, type: 0
KERN:   mtrr:  1: base: 0xc0000000, size: 0x40000000, type: 0
KERN: kernel debugger extension "debugger/invalidate_on_exit/v1": loaded
KERN: kernel debugger extension "debugger/hangman/v1": loaded
KERN: kernel debugger extension "debugger/usb_keyboard/v1": loaded
KERN: kernel debugger extension "debugger/run_on_exit/v1": loaded
KERN: kernel debugger extension "debugger/disasm/v1": loaded
KERN: add_memory_type_range(93, 0xd0000000, 0x8ca000, 1)
KERN: set MTRRs to:
KERN:   mtrr:  0: base: 0xbf780000, size:    0x80000, type: 0
KERN:   mtrr:  1: base: 0xe0000000, size: 0x20000000, type: 0
KERN:   mtrr:  2: base: 0xc0000000, size: 0x40000000, type: 1
KERN: publish device: node 0xe219f9b8, path random, module bus_managers/random/device_v1
KERN: publish device: node 0xe219f9b8, path urandom, module bus_managers/random/device_v1
KERN: publish device: node 0xe219faf8, path power/button/power, module drivers/power/acpi_button/device_v1
KERN: publish device: node 0xe219fb48, path power/button/power_fixed, module drivers/power/acpi_button/device_v1
KERN: REG: Failed to open shadow passwd DB file "/etc/shadow": No such file or directory
KERN: usb hub 48: port 0: new device connected
KERN: package_daemon [79013430:   547] root at "/boot" (device: 3, node: 524288) registered
KERN: package_daemon [79015259:   547] volume at "/boot/system" registered
KERN: package_daemon [79017021:   547] volume at "/boot/home/config" registered
KERN: register_domain(9, unix)
KERN: unregister_domain(0x82805200, 9, unix)
KERN: register_domain(5, internet6)
KERN: unregister_domain(0x82bd4800, 5, internet6)
KERN: register_domain(5, internet6)
KERN: unregister_domain(0x82bd4c80, 5, internet6)
KERN: register_domain(1, internet)
KERN: unregister_domain(0x82bd4c80, 1, internet)
KERN: register_domain(5, internet6)
KERN: unregister_domain(0x828051c0, 5, internet6)
KERN: register_domain(5, internet6)
KERN: unregister_domain(0x828051c0, 5, internet6)

I tried it with just the USB adapter, without the SD card plugged into it, and it only registers the device.... then goes no further, even if you add the SD card, plug and unplug, etc.

KERN: usb hub 48: port 0: new device connected

Nothing after that no matter what. It needs the SD card in the adapter.

I might try it again in a minute with another USB drive as well, but with the SD card adapter I can repeatedly boot now using that method (and am writing this from within the latest nightly, hrev48579 , installed on the hardware.)

Next I'll see if I can track this down to a specific revision.

comment:5 by diver, 10 years ago

Component: SystemDrivers/USB
Owner: changed from nobody to mmlr

There're a lot of Last message repeated x times messages in your syslog. But I couldn't find which message it actually is. In any way, it sounds like a problem with usb drivers.

comment:6 by jstressman, 10 years ago

OK... after doing some "git bisect" investigating, and double checking with extra clean checkouts, we've narrowed down the introduction of the boot problem to Pawel's scheduler changes in hrev46690

justin@kuroiishi:~/Downloads/Work/Haiku-stuff/source/haiku$ git bisect bad
7db89e8dc395db73368479fd9817b2b67899f3f6 is the first bad commit
commit 7db89e8dc395db73368479fd9817b2b67899f3f6
Author: Pawel Dziepak <pdziepak@quarnos.org>
Date:   Mon Nov 25 23:50:27 2013 +0100

    kernel: Rework cpuidle module
    
     * Create new interface for cpuidle modules (similar to the cpufreq
       interface)
     * Generic cpuidle module is no longer needed
     * Fix and update Intel C-State module

:040000 040000 d6d4c434399da0e69156647e39bed1ebdb778a74 fcfcbc1b845bdc8c2e81c74c5273f95a37639f2b M	build
:040000 040000 9c13aa292779e9727b668f115b24964ca901e395 36a062667b6180e3aaf582f26304bf5d714f6b24 M	headers
:040000 040000 67dc2e11188800a618d27dcd181af3f91dd589a9 e1cee28ce54667209e412294131cdcbf61371619 M	src

I have no idea how this has led to the /etc/shadow error, but it appears to be the problematic revision.

On top of this bug, I noticed another bug that I've filed another ticket for (#11685);

The CPU up through hrev46689 would idle between 0 and 2%... almost flatline... but starting with hrev46690, it will spike every few seconds for anywhere from half a second to a few seconds during which the system basically freezes. This makes the system almost unusable for getting any work done.

I'll attach screenshots of ActivityMonitor on a fresh boot of hrev46689 and the latest nightly hrev48579 (which behaves the same as hrev46690). These are both just sitting totally idle with nothing else running.

comment:7 by waddlesplash, 10 years ago

I don't think the "/etc/shadow" error has anything to do with this (it appears in the good syslogs too) it just happens to be the last thing to happen before CPU-idle kicks in.

comment:8 by jstressman, 10 years ago

Ah, that makes sense. Thank you.

comment:9 by mmlr, 10 years ago

Component: Drivers/USB- General
Owner: changed from mmlr to nobody
Status: newassigned

The repeat messages stem from the wireless driver probe requests:

KERN: [net/atheroswifi/0] ieee80211_ref_node (ieee80211_send_probereq:1748) 0xe2136000<1c:bd:b9:d2:eb:c7> refcnt 3
KERN: Last message repeated 20 times.
Last message repeated 2 times

The KERN prefixed ones are from the kernel. The first repeat is the repeat indicator of the last kernel log message (the wifi one). The unprefixed one is the repeat indicator of the syslog daemon. In this case the kernel repeat of 20 is itself repeated twice. In an otherwise idle system this does not look problematic.

As noted above the "/etc/shadow" message is normal. The shadow password file is only created if you actually set a password for your user or create other users.

As a side note: From reading the syslog it looks like you are running your disk controller in IDE/compatibility mode which, except for actual compatibility with already installed other OSes that require it, is rather suboptimal. Using AHCI mode reduces the overhead of disk IO greatly.

The actual problem is unfortunately still hard to diagnose with the given revision. Even though it is only a single hrev, it points to a merge commit that brings in a lot of changes at once. Note also that there were some issues with the merged scheduler branch early on that only got exposed on certain hardware and were subsequently fixed in later hrevs.

Ideally you could bisect the scheduler branch itself, to find the commit within that first caused these symptoms. Unfortunately there are no pre-built images of that branch AFAIR.

comment:10 by waddlesplash, 10 years ago

Actually, look at the commit that Git listed as causing the issue: it's not the scheduler merge itself, it's one of the commits that was included in the merge. Looks like Git was smart enough to figure out there was a merge, and bisected everything in the merge, too.

comment:11 by jstressman, 10 years ago

Almost all of the bisections after the initial one seemed to focus on hrev46690 (unless I was reading the logs wrong?).

For example;

~/source/haiku> git describe --tag
hrev46160-148-g7db89e8
~/source/haiku> git tag --contains hrev46160-148-g7db89e8 | head -n1
hrev46690
~/source/haiku>

I have no problem building images and testing them etc. I just might need help finding my way around git (as folks have helpfully provided thus far).

comment:12 by waddlesplash, 10 years ago

That makes sense -- IIRC, Pawel branched master off for his scheduler work at hrev46160, so -148 means 148 commits into his scheduler branch.

comment:13 by pulkomandy, 10 years ago

Owner: changed from nobody to pdziepak

comment:15 by jstressman, 10 years ago

An update... I blacklisted the cpuidle module in add-ons/kernel/power/cpuidle and both of my problems disappeared. (this one, and #11685)

hrev48579

comment:16 by diver, 10 years ago

Component: - GeneralDrivers/Power

comment:17 by waddlesplash, 10 years ago

Does the CPUIdle module count as the "power" module? AFAIK, it's really part of the scheduler.

in reply to:  17 comment:18 by anevilyak, 10 years ago

Replying to waddlesplash:

Does the CPUIdle module count as the "power" module? AFAIK, it's really part of the scheduler.

http://cgit.haiku-os.org/haiku/tree/src/add-ons/kernel/power/cpuidle

Last edited 10 years ago by anevilyak (previous) (diff)

comment:19 by michaelvoliveira, 9 years ago

The same here

comment:20 by diver, 9 years ago

jstressman, any updates on this one? Have you been able to bisect it any further?

comment:21 by bruno, 8 years ago

I have a similar problem now since 30.06.2016 i cannot boot behond the rocket icon... I had no problems before to update Haiku via pkgman... disabeling and/or use save mode has same result in stopping by the rocket icon... I opened a bug report as well. Maybe it is a different problem...

I think if i do a clean new install it will work... but I have by now some data I do not want to loose... also it might a problem for future Haiku users if the problem cannot be resolved... so I wait until i can get Haiku to update again using pkgman...

comment:22 by waddlesplash, 8 years ago

jstressman, any updates on this one? Have you been able to bisect it any further?

As seen in comment:6, the problem is the new power module.

comment:23 by waddlesplash, 6 years ago

Keywords: boot-failure added

comment:24 by waddlesplash, 6 years ago

Blocking: 7665 removed

comment:25 by waddlesplash, 6 years ago

Milestone: R1/beta1Unscheduled

Since nobody (or at least very few?) others has reported this issue on their hardware, moving out of beta1 milestone as low-priority.

comment:26 by thaflo, 6 years ago

It happened here; I removed all repos and added only the r1beta1 repo. Then I did a pkgman update, which downloaded very few files (I had to update from somehting 51xxx) Now, after reboot, it hangs on the rocket. VmWare (gcc2) - Can't access boot loader menu, as VMWWare boots that fast and SHIFT or SPACE don't call the menu. Update: Could no access the bootloader and got the on screen debug output: https://i.imgur.com/LYR25s6.png

Last edited 6 years ago by thaflo (previous) (diff)

comment:27 by diver, 6 years ago

libz.so is somehow got corrupted which broke loading launch_daemon.

comment:28 by waddlesplash, 2 months ago

Is this still a problem, after the intervening scheduler, power, and kernel synchronization primitives fixes over the years since?

comment:29 by jstressman, 2 months ago

I don't have a way to check as I finally retired that computer a year or two ago. Probably safe to retire this one and if someone actually runs into the problem we can reopen it or open a new one.

comment:30 by waddlesplash, 2 months ago

Resolution: not reproducible
Status: assignedclosed

Thanks for responding!

Note: See TracTickets for help on using tickets.