Opened 10 years ago
Closed 5 weeks ago
#11020 closed bug (not reproducible)
Boot hangs on rocket
Reported by: | jstressman | Owned by: | pdziepak |
---|---|---|---|
Priority: | normal | Milestone: | Unscheduled |
Component: | Drivers/Power | Version: | R1/Development |
Keywords: | boot-failure | Cc: | |
Blocked By: | Blocking: | ||
Platform: | All |
Description
Revisions up to and including hrev46677 boot properly for me. But hrev46732 and above boot as far as the rocket and then just hang there.
I've safe mode booted and tried safe mode video, but that didn't help. I displayed debug output to the screen and disabled pageination and the output stops right after the message about:
REG: failed to open shadow passwd DB file "/etc/shadow": No such file or directory
That's the last line of the debug output. It just sits there after that and doesn't boot (I waited like 20 minutes and nothing had happened yet, multiple boots of that and other recent revisions had the same behavior.)
I'll attach my listdev output and syslog(s) from my working hrev46677 install.
Attachments (3)
Change History (33)
by , 10 years ago
Attachment: | listdev.txt added |
---|
by , 10 years ago
Attachment: | syslog.old added |
---|
by , 10 years ago
Attachment: | syslog.txt added |
---|
comment:1 by , 10 years ago
Blocking: | 7665 added |
---|
comment:2 by , 10 years ago
Component: | - General → System/Kernel |
---|---|
Milestone: | R1 → R1/beta1 |
Owner: | changed from | to
Priority: | normal → high |
Still an issue, according to jstressman on IRC.
Looking at those syslogs, the line mentioned in the description is there:
4921 KERN: REG: Failed to open shadow passwd DB file "/etc/shadow": No such file or directory
But it doesn't stop the boot, oddly enough.
Could you bisect further to find the exact revision that causes the problem? You'll have to build those revisions yourself.
comment:3 by , 10 years ago
Component: | System/Kernel → System |
---|---|
Owner: | changed from | to
Priority: | high → normal |
Without a more specific revision, there's currently no information that specifically implicates the kernel. That stage of boot is when all the userland servers are being loaded anyways.
comment:4 by , 10 years ago
I'll see if I can track down a more specific revision. But in the meantime I accidentally stumbled across a work-around.
I plugged in a USB SD card adapter while it was sitting there frozen at the previously mentioned /etc/shadow line, and suddenly the boot resumed and finished successfully.
KERN: mtrr: 0: base: 0xbf780000, size: 0x80000, type: 0 KERN: mtrr: 1: base: 0xc0000000, size: 0x40000000, type: 0 KERN: add_memory_type_range(-1, 0x100000000, 0x240000000, 6) KERN: set MTRRs to: KERN: mtrr: 0: base: 0xbf780000, size: 0x80000, type: 0 KERN: mtrr: 1: base: 0xc0000000, size: 0x40000000, type: 0 KERN: kernel debugger extension "debugger/invalidate_on_exit/v1": loaded KERN: kernel debugger extension "debugger/hangman/v1": loaded KERN: kernel debugger extension "debugger/usb_keyboard/v1": loaded KERN: kernel debugger extension "debugger/run_on_exit/v1": loaded KERN: kernel debugger extension "debugger/disasm/v1": loaded KERN: add_memory_type_range(93, 0xd0000000, 0x8ca000, 1) KERN: set MTRRs to: KERN: mtrr: 0: base: 0xbf780000, size: 0x80000, type: 0 KERN: mtrr: 1: base: 0xe0000000, size: 0x20000000, type: 0 KERN: mtrr: 2: base: 0xc0000000, size: 0x40000000, type: 1 KERN: publish device: node 0xe219f9b8, path random, module bus_managers/random/device_v1 KERN: publish device: node 0xe219f9b8, path urandom, module bus_managers/random/device_v1 KERN: publish device: node 0xe219faf8, path power/button/power, module drivers/power/acpi_button/device_v1 KERN: publish device: node 0xe219fb48, path power/button/power_fixed, module drivers/power/acpi_button/device_v1 KERN: REG: Failed to open shadow passwd DB file "/etc/shadow": No such file or directory KERN: usb hub 48: port 0: new device connected KERN: package_daemon [79013430: 547] root at "/boot" (device: 3, node: 524288) registered KERN: package_daemon [79015259: 547] volume at "/boot/system" registered KERN: package_daemon [79017021: 547] volume at "/boot/home/config" registered KERN: register_domain(9, unix) KERN: unregister_domain(0x82805200, 9, unix) KERN: register_domain(5, internet6) KERN: unregister_domain(0x82bd4800, 5, internet6) KERN: register_domain(5, internet6) KERN: unregister_domain(0x82bd4c80, 5, internet6) KERN: register_domain(1, internet) KERN: unregister_domain(0x82bd4c80, 1, internet) KERN: register_domain(5, internet6) KERN: unregister_domain(0x828051c0, 5, internet6) KERN: register_domain(5, internet6) KERN: unregister_domain(0x828051c0, 5, internet6)
I tried it with just the USB adapter, without the SD card plugged into it, and it only registers the device.... then goes no further, even if you add the SD card, plug and unplug, etc.
KERN: usb hub 48: port 0: new device connected
Nothing after that no matter what. It needs the SD card in the adapter.
I might try it again in a minute with another USB drive as well, but with the SD card adapter I can repeatedly boot now using that method (and am writing this from within the latest nightly, hrev48579 , installed on the hardware.)
Next I'll see if I can track this down to a specific revision.
comment:5 by , 10 years ago
Component: | System → Drivers/USB |
---|---|
Owner: | changed from | to
There're a lot of Last message repeated x times messages in your syslog. But I couldn't find which message it actually is. In any way, it sounds like a problem with usb drivers.
comment:6 by , 10 years ago
OK... after doing some "git bisect" investigating, and double checking with extra clean checkouts, we've narrowed down the introduction of the boot problem to Pawel's scheduler changes in hrev46690
justin@kuroiishi:~/Downloads/Work/Haiku-stuff/source/haiku$ git bisect bad 7db89e8dc395db73368479fd9817b2b67899f3f6 is the first bad commit commit 7db89e8dc395db73368479fd9817b2b67899f3f6 Author: Pawel Dziepak <pdziepak@quarnos.org> Date: Mon Nov 25 23:50:27 2013 +0100 kernel: Rework cpuidle module * Create new interface for cpuidle modules (similar to the cpufreq interface) * Generic cpuidle module is no longer needed * Fix and update Intel C-State module :040000 040000 d6d4c434399da0e69156647e39bed1ebdb778a74 fcfcbc1b845bdc8c2e81c74c5273f95a37639f2b M build :040000 040000 9c13aa292779e9727b668f115b24964ca901e395 36a062667b6180e3aaf582f26304bf5d714f6b24 M headers :040000 040000 67dc2e11188800a618d27dcd181af3f91dd589a9 e1cee28ce54667209e412294131cdcbf61371619 M src
I have no idea how this has led to the /etc/shadow error, but it appears to be the problematic revision.
On top of this bug, I noticed another bug that I've filed another ticket for (#11685);
The CPU up through hrev46689 would idle between 0 and 2%... almost flatline... but starting with hrev46690, it will spike every few seconds for anywhere from half a second to a few seconds during which the system basically freezes. This makes the system almost unusable for getting any work done.
I'll attach screenshots of ActivityMonitor on a fresh boot of hrev46689 and the latest nightly hrev48579 (which behaves the same as hrev46690). These are both just sitting totally idle with nothing else running.
comment:7 by , 10 years ago
I don't think the "/etc/shadow" error has anything to do with this (it appears in the good syslogs too) it just happens to be the last thing to happen before CPU-idle kicks in.
comment:9 by , 10 years ago
Component: | Drivers/USB → - General |
---|---|
Owner: | changed from | to
Status: | new → assigned |
The repeat messages stem from the wireless driver probe requests:
KERN: [net/atheroswifi/0] ieee80211_ref_node (ieee80211_send_probereq:1748) 0xe2136000<1c:bd:b9:d2:eb:c7> refcnt 3 KERN: Last message repeated 20 times. Last message repeated 2 times
The KERN prefixed ones are from the kernel. The first repeat is the repeat indicator of the last kernel log message (the wifi one). The unprefixed one is the repeat indicator of the syslog daemon. In this case the kernel repeat of 20 is itself repeated twice. In an otherwise idle system this does not look problematic.
As noted above the "/etc/shadow" message is normal. The shadow password file is only created if you actually set a password for your user or create other users.
As a side note: From reading the syslog it looks like you are running your disk controller in IDE/compatibility mode which, except for actual compatibility with already installed other OSes that require it, is rather suboptimal. Using AHCI mode reduces the overhead of disk IO greatly.
The actual problem is unfortunately still hard to diagnose with the given revision. Even though it is only a single hrev, it points to a merge commit that brings in a lot of changes at once. Note also that there were some issues with the merged scheduler branch early on that only got exposed on certain hardware and were subsequently fixed in later hrevs.
Ideally you could bisect the scheduler branch itself, to find the commit within that first caused these symptoms. Unfortunately there are no pre-built images of that branch AFAIR.
comment:10 by , 10 years ago
Actually, look at the commit that Git listed as causing the issue: it's not the scheduler merge itself, it's one of the commits that was included in the merge. Looks like Git was smart enough to figure out there was a merge, and bisected everything in the merge, too.
comment:11 by , 10 years ago
Almost all of the bisections after the initial one seemed to focus on hrev46690 (unless I was reading the logs wrong?).
For example;
~/source/haiku> git describe --tag hrev46160-148-g7db89e8 ~/source/haiku> git tag --contains hrev46160-148-g7db89e8 | head -n1 hrev46690 ~/source/haiku>
I have no problem building images and testing them etc. I just might need help finding my way around git (as folks have helpfully provided thus far).
comment:12 by , 10 years ago
That makes sense -- IIRC, Pawel branched master off for his scheduler work at hrev46160, so -148 means 148 commits into his scheduler branch.
comment:13 by , 10 years ago
Owner: | changed from | to
---|
comment:15 by , 10 years ago
comment:16 by , 10 years ago
Component: | - General → Drivers/Power |
---|
follow-up: 18 comment:17 by , 10 years ago
Does the CPUIdle module count as the "power" module? AFAIK, it's really part of the scheduler.
comment:18 by , 10 years ago
Replying to waddlesplash:
Does the CPUIdle module count as the "power" module? AFAIK, it's really part of the scheduler.
http://cgit.haiku-os.org/haiku/tree/src/add-ons/kernel/power/cpuidle
comment:20 by , 9 years ago
jstressman, any updates on this one? Have you been able to bisect it any further?
comment:21 by , 8 years ago
I have a similar problem now since 30.06.2016 i cannot boot behond the rocket icon... I had no problems before to update Haiku via pkgman... disabeling and/or use save mode has same result in stopping by the rocket icon... I opened a bug report as well. Maybe it is a different problem...
I think if i do a clean new install it will work... but I have by now some data I do not want to loose... also it might a problem for future Haiku users if the problem cannot be resolved... so I wait until i can get Haiku to update again using pkgman...
comment:22 by , 8 years ago
jstressman, any updates on this one? Have you been able to bisect it any further?
As seen in comment:6, the problem is the new power module.
comment:23 by , 6 years ago
Keywords: | boot-failure added |
---|
comment:24 by , 6 years ago
Blocking: | 7665 removed |
---|
comment:25 by , 6 years ago
Milestone: | R1/beta1 → Unscheduled |
---|
Since nobody (or at least very few?) others has reported this issue on their hardware, moving out of beta1 milestone as low-priority.
comment:26 by , 6 years ago
It happened here; I removed all repos and added only the r1beta1 repo. Then I did a pkgman update, which downloaded very few files (I had to update from somehting 51xxx) Now, after reboot, it hangs on the rocket. VmWare (gcc2) - Can't access boot loader menu, as VMWWare boots that fast and SHIFT or SPACE don't call the menu. Update: Could no access the bootloader and got the on screen debug output:
comment:28 by , 5 weeks ago
Is this still a problem, after the intervening scheduler, power, and kernel synchronization primitives fixes over the years since?
comment:29 by , 5 weeks ago
I don't have a way to check as I finally retired that computer a year or two ago. Probably safe to retire this one and if someone actually runs into the problem we can reopen it or open a new one.
comment:30 by , 5 weeks ago
Resolution: | → not reproducible |
---|---|
Status: | assigned → closed |
Thanks for responding!
Second "half" of the syslog. See syslog.old for the first "half".