Opened 10 years ago

Closed 8 years ago

#11280 closed bug (fixed)

Intermittent no USB on boot

Reported by: vidrep Owned by: mmlr
Priority: normal Milestone: R1
Component: Drivers/USB Version: R1/Development
Keywords: Cc:
Blocked By: Blocking: #11234, #11432, #12355
Platform: All

Description

Boot hrev47896 x86_64. USB mouse no functions and process controller showing 100% CPU on one core. USB keyboard works. Sometimes the network config fails and there are no USB devices whatsoever. Must reboot to recover system.

Attachments (50)

syslog (211.3 KB ) - added by vidrep 10 years ago.
previous_syslog (113.4 KB ) - added by vidrep 10 years ago.
syslog.2 (102.1 KB ) - added by vidrep 10 years ago.
syslog.old (512.1 KB ) - added by vidrep 10 years ago.
IMG_0185.JPG (1.2 MB ) - added by vidrep 10 years ago.
IMG_0182.JPG (1.4 MB ) - added by vidrep 10 years ago.
IMG_0184.JPG (1.2 MB ) - added by vidrep 10 years ago.
HP_dc5750_serial.txt (109.1 KB ) - added by vidrep 10 years ago.
previous_syslog.2 (112.0 KB ) - added by vidrep 10 years ago.
syslog.3 (445.8 KB ) - added by vidrep 10 years ago.
PathMonitor_bt.log (2.8 KB ) - added by diver 10 years ago.
syslog_diver (449.5 KB ) - added by diver 10 years ago.
syslog.4 (65.2 KB ) - added by vidrep 9 years ago.
previous_syslog.3 (114.8 KB ) - added by vidrep 9 years ago.
syslog.2.old (512.0 KB ) - added by vidrep 9 years ago.
syslog.5 (328.1 KB ) - added by vidrep 9 years ago.
previous_syslog.4 (116.5 KB ) - added by vidrep 9 years ago.
previous_syslog.5 (119.7 KB ) - added by vidrep 9 years ago.
syslog.6 (394.8 KB ) - added by vidrep 9 years ago.
syslog.7 (237.0 KB ) - added by vidrep 9 years ago.
47479_syslog (256.3 KB ) - added by vidrep 9 years ago.
47479_previous_syslog (111.9 KB ) - added by vidrep 9 years ago.
47479_syslog.old (512.0 KB ) - added by vidrep 9 years ago.
previous_syslog_49413 (117.3 KB ) - added by vidrep 9 years ago.
syslog_49413 (479.7 KB ) - added by vidrep 9 years ago.
IMG_0130.JPG (1.7 MB ) - added by vidrep 9 years ago.
tail_teams.JPG (1.5 MB ) - added by vidrep 9 years ago.
threads.JPG (1.8 MB ) - added by vidrep 9 years ago.
PathMonitor_serial_log.txt (154.3 KB ) - added by vidrep 9 years ago.
syslog.8 (390.8 KB ) - added by vidrep 9 years ago.
Jamfile (1.5 KB ) - added by vidrep 9 years ago.
UserBuildConfig (1.2 KB ) - added by vidrep 9 years ago.
jam_input_server (12.1 KB ) - added by vidrep 9 years ago.
Debugger-607-debug-24-10-2015-08-55-22.report (13.4 KB ) - added by vidrep 8 years ago.
input_server-460-debug-24-10-2015-08-57-27.report (11.2 KB ) - added by vidrep 8 years ago.
10242015.txt (595.1 KB ) - added by vidrep 8 years ago.
input_server-465-debug-24-10-2015-18-36-27.report (10.8 KB ) - added by vidrep 8 years ago.
input_server-466-debug-24-10-2015-20-03-45.report (8.1 KB ) - added by vidrep 8 years ago.
input_server-463-debug-24-10-2015-23-20-36.report (8.1 KB ) - added by vidrep 8 years ago.
input_server-467-debug-24-10-2015-23-25-23.report (11.2 KB ) - added by vidrep 8 years ago.
syslog.9 (219.7 KB ) - added by vidrep 8 years ago.
listimage (36.9 KB ) - added by vidrep 8 years ago.
input_server-461-debug-03-11-2015-21-14-28.report (13.6 KB ) - added by vidrep 8 years ago.
input_server-463-debug-03-11-2015-21-18-05.report (8.2 KB ) - added by vidrep 8 years ago.
input_server-463-debug-04-11-2015-19-35-30.report (13.4 KB ) - added by vidrep 8 years ago.
syslog.10 (23.7 KB ) - added by vidrep 8 years ago.
syslog.3.old (512.0 KB ) - added by vidrep 8 years ago.
previous_syslog.6 (165.6 KB ) - added by vidrep 8 years ago.
11052015.txt (170.5 KB ) - added by vidrep 8 years ago.
input_server-461-debug-05-11-2015-20-01-27.report (13.3 KB ) - added by vidrep 8 years ago.

Change History (143)

by vidrep, 10 years ago

Attachment: syslog added

comment:1 by diver, 10 years ago

Component: - GeneralDrivers/USB
Owner: changed from nobody to mmlr

Changing component to USB but I didn't spot any usb related errors in your syslog. Could you check what your CPU is busy with?

comment:2 by diver, 10 years ago

Does your mouse work in x86 Haiku? Please open separate tickets for network problem and missing USB devices.

comment:3 by vidrep, 10 years ago

May be related to ticket #11234. It happens on both 32 and 64 bit.

comment:4 by vidrep, 10 years ago

Summary: USB intermittent no function on boot 64 bit.Intermittent no USB on boot

comment:5 by vidrep, 10 years ago

Updated to hrev47901 x86_gcc2 using "pkgman update", "shutdown -r". Upon reboot the system is frozen - no USB keyboard or mouse. Process controller is showing high CPU load (see attached photo). Attached syslog(s). I will try to get a serial debug log to post later.

by vidrep, 10 years ago

Attachment: previous_syslog added

by vidrep, 10 years ago

Attachment: syslog.2 added

by vidrep, 10 years ago

Attachment: syslog.old added

by vidrep, 10 years ago

Attachment: IMG_0185.JPG added

comment:6 by ttcoder, 10 years ago

Been having very similar symtoms on this Asus system for a few weeks. Seems to occur on cold boots, not warm boots.

Not easy accessing stuff without a working mouse :-), but I managed to check the following:

  • unlike the above posted syslog, the syslog here shows USB traffic in the last few lines, but not errors, just "device added" type lines.
  • managed to launch a Terminal and type top and I got the following: the offending team is input_server and the thread is something like PathMonitor loop

Is it the same for you vidrep? Otherwise I will file a different ticket.

Could be yet another case of an uninitialized variable in input_server or some such... (I had other boot-dependant trouble with input_server the past couple years).

EDIT: the thread is a BLooper instantiated here http://grok.bikemonkey.org/source/xref/haiku/src/kits/storage/PathMonitor.cpp#2146 , presumably from AddOnManager.cpp:220 ; analyzing further seems quite non-trivial..

Last edited 10 years ago by ttcoder (previous) (diff)

in reply to:  6 ; comment:7 by korli, 10 years ago

Replying to ttcoder:

  • managed to launch a Terminal and type top and I got the following: the offending team is input_server and the thread is something like PathMonitor loop

I got this several times already when installing a package like Freetype. It stops after a while. I'd say it worth a ticket at least.

comment:8 by vidrep, 10 years ago

It happened again this morning on hrev47901 x86_64. This time, the screen "tore" before the desktop appeared (see image0182), only the USB keyboard was non-functional, and one CPU core was 100% (see image0184). I was logging the serial output at the time which I have attached (HP_dc5750_serial. I'll attach the syslog on the next reboot.

Last edited 10 years ago by vidrep (previous) (diff)

by vidrep, 10 years ago

Attachment: IMG_0182.JPG added

by vidrep, 10 years ago

Attachment: IMG_0184.JPG added

by vidrep, 10 years ago

Attachment: HP_dc5750_serial.txt added

by vidrep, 10 years ago

Attachment: previous_syslog.2 added

by vidrep, 10 years ago

Attachment: syslog.3 added

comment:9 by vidrep, 10 years ago

The same symptoms as described earlier also apply to 32 bit build as if hrev47912, including tearing of boot screen and both CPU indicating 100% by process controller.

Last edited 10 years ago by vidrep (previous) (diff)

in reply to:  7 comment:10 by ttcoder, 10 years ago

Replying to korli:

Replying to ttcoder:

  • managed to launch a Terminal and type top and I got the following: the offending team is input_server and the thread is something like PathMonitor loop

I got this several times already when installing a package like Freetype. It stops after a while. I'd say it worth a ticket at least.

Looks like jessica and diver granted that wish with #11049 :-) (if I'm following that ticket correctly)

comment:11 by diver, 10 years ago

kdebug> threads 234
thread         id  state     wait for  object      cpu pri  stack      team  name
0x82a00000    234  waiting   cvar      0xd2dda6a0    -  20  0xd1ff7000  234  input_server
0xd301ed40    240  waiting   cvar      0xd2ffb748    - 103  0x81d2b000  234  _input_server_event_loop_
0xd301e4a0    241  waiting   sem       1351          -  10  0x81d33000  234  AddOnMonitor
0xd301d7b0    244  ready     -                       -   1  0x81d3f000  234  PathMonitor looper
0xd301cf10    246  waiting   cvar      0xd2fffe64    - 104  0x81d65000  234  Tablet Tablet 1 watcher

kdebug> bt 244
stack trace for thread 244 "PathMonitor looper"
    kernel stack: 0x81d3f000 to 0x81d43000
      user stack: 0x7a678000 to 0x7a6b8000
frame               caller     <image>:function + offset
 0 81d42e4c (+ 224) 80094b37   <kernel_x86> reschedule(int32: 2) + 0x1007
 1 81d42f2c (+  48) 80094be1   <kernel_x86> scheduler_reschedule + 0x61
 2 81d42f5c (+  64) 801438f1   <kernel_x86> x86_hardware_interrupt + 0xe1
 3 81d42f9c (+  12) 80136ace   <kernel_x86> int_bottom_user + 0x73
user iframe at 0x81d42fa8 (end = 0x81d43000)
 eax 0x2           ebx 0xa027f4       ecx 0x18417e08  edx 0x7a6b7744
 esi 0x18417de8    edi 0x7a6b7744     ebp 0x7a6b76c8  esp 0x81d42fdc
 eip 0x90f8ce   eflags 0x13246   user esp 0x7a6b76b0
 vector: 0xfb, error code: 0x0
 4 81d42fa8 (+   0) 0090f8ce   <libbe.so> node_ref<0x7a6b7744>::__eq(node_ref: 0x18417e08, node_ref&: 0x183e789e) + 0x16
 5 7a6b76c8 (+  48) 00917be5   <libbe.so> _GLOBAL_.N._home_builder_builds_haiku_src_kits_storage_PathMonitor.cppCMHPCe::PathHandler<0x184434f8>::_GetAncestor(_GLOBAL_.N._home_builder_builds_haiku_src_kits_storage_PathMonitor.cppCMHPCe::PathHandler: 0x7a6b7744, node_ref&: 0x17) + 0x4d
 6 7a6b76f8 (+ 224) 00917079   <libbe.so> _GLOBAL_.N._home_builder_builds_haiku_src_kits_storage_PathMonitor.cppCMHPCe::PathHandler<0x184434f8>::_EntryCreated(BPrivate::NotOwningEntryRef&: 0x7a6b78b8, node_ref&: 0x7a6b78ac, true) + 0x179
 7 7a6b77d8 (+ 240) 009157fa   <libbe.so> _GLOBAL_.N._home_builder_builds_haiku_src_kits_storage_PathMonitor.cppCMHPCe::PathHandler<0x184434f8>::_EntryCreated(BMessage*: 0x183e3b50) + 0x1d2
 8 7a6b78c8 (+  48) 0091497a   <libbe.so> _GLOBAL_.N._home_builder_builds_haiku_src_kits_storage_PathMonitor.cppCMHPCe::PathHandler<0x184434f8>::MessageReceived(BMessage*: 0x183e3b50) + 0x62
 9 7a6b78f8 (+  48) 007e7b23   <libbe.so> BLooper<0x184435c8>::DispatchMessage(BMessage*: 0x183e3b50, BHandler*: 0x184434f8) + 0x5b
10 7a6b7928 (+  64) 007e93cd   <libbe.so> BLooper<0x184435c8>::task_looper(0x0) + 0x211
11 7a6b7968 (+  48) 007e8fbb   <libbe.so> BLooper<0x184435c8>::_task0_(NULL) + 0x3f
12 7a6b7998 (+  48) 01455feb   <libroot.so> _get_next_team_info (nearest) + 0x5f
13 7a6b79c8 (+   0) 613ef250   <commpage> commpage_thread_exit + 0x00

by diver, 10 years ago

Attachment: PathMonitor_bt.log added

by diver, 10 years ago

Attachment: syslog_diver added

comment:12 by vidrep, 9 years ago

This is still an issue with latest revision (hrev48147 x86_64). First boot after updating with "pkgman update", system freeze and 100% on one cpu core. It is also happening with 32 bit builds.

Last edited 9 years ago by vidrep (previous) (diff)

comment:13 by ttcoder, 9 years ago

@vidrep do you have several partitions that get mounted at boot time, or just a unique "/Haiku" volume ? If you have several, what happens if you disable their auto-mounting at boot time, does the bug go away ?

I ask because of a tiney-weeny small-ish chance that I've found a/the trigger of the pathmonitor loop bug: that bug just occured to me on my laptop for the first time (it usually occurs on my desktop).. And it happened as I (also for the first time in weeks) manually mounted an ntfs partition, a few seconds after the desktop appeared. So my theory is that maybe it's the BMessage received due to subscribing withBVolumeRoster::StartWatching() or whatever, that throws pathmonitor a curve, at least part of the time.

in reply to:  13 comment:14 by vidrep, 9 years ago

Replying to ttcoder:

@vidrep do you have several partitions that get mounted at boot time, or just a unique "/Haiku" volume ? If you have several, what happens if you disable their auto-mounting at boot time, does the bug go away ?

I ask because of a tiney-weeny small-ish chance that I've found a/the trigger of the pathmonitor loop bug: that bug just occured to me on my laptop for the first time (it usually occurs on my desktop).. And it happened as I (also for the first time in weeks) manually mounted an ntfs partition, a few seconds after the desktop appeared. So my theory is that maybe it's the BMessage received due to subscribing withBVolumeRoster::StartWatching() or whatever, that throws pathmonitor a curve, at least part of the time.

I have three Haiku partitions on my hard drive - alpha 4.1,dev, and x86_64. Each is booted separately from the bootman menu. These are not mounted at boot time. I only mount the other partitions sometimes when I need to copy a file.

comment:15 by vidrep, 9 years ago

Since updating to hrev48720 x86_gcc2 the problem is much worse than on previous builds. Usually 2 or 3 boot attempts are required. I have attached the latest syslog(s), should they prove to be useful.

by vidrep, 9 years ago

Attachment: syslog.4 added

by vidrep, 9 years ago

Attachment: previous_syslog.3 added

by vidrep, 9 years ago

Attachment: syslog.2.old added

comment:16 by ttcoder, 9 years ago

Still there in hrev48958 ; I take back what I said about cold boots though: this time it occured on a warm boot. The USB mouse still worked, only the USB keyboard refused to work, 100% CPU usage on one core ..etc. They're both connected through a hub, and the hub to the desktop, if that matters.

Infrequent, hard to reproduce bug here.

comment:17 by vidrep, 9 years ago

Still here in 48971; no mouse and one cpu core 100%; had to reboot 5 times before I had both keyboard and mouse. I'll attach a syslog after a few more tries.

comment:18 by vidrep, 9 years ago

I couldn't get x86_gcc2 to do it again, but my x86_64 partition did it right away. Attached are the syslog and previous syslog. hrev48982 x86_64

Last edited 9 years ago by vidrep (previous) (diff)

by vidrep, 9 years ago

Attachment: syslog.5 added

by vidrep, 9 years ago

Attachment: previous_syslog.4 added

comment:19 by pulkomandy, 9 years ago

Blocking: 11432 added

(In #11432) Yes. Let's track the problem there then.

comment:20 by ttcoder, 9 years ago

It's occuring right now on my desktop, keeping it open in debugger in case someone wants me to try out a command?

I tried to step.. step.. step.. for a while in debugger (as well as Run/Debug/Run a dozen times), and we never ever get out of GetAncestor(), I'm always either in node_ref::eq() or in GeAncestor().

So I would theorize that..

Plausible?

EDIT: the registers for the argument to node_ref::eq() have always the same value, so if I read the registers right, it seems the line

slot = _Link(slot);

does not change the value of slot, i.e. slot is linked to itself, so we never reach a slot that would make us break.

Last edited 9 years ago by ttcoder (previous) (diff)

comment:21 by pulkomandy, 9 years ago

A slot being linked to itself shouldn't happen, so if this is the case, it would mean the hash table is corrupt. This part is behaving like a linked list.

Does enabling the tracing in PathMonitor.cpp reveal anything interesting? In particular, is the PathHandler class used properly (no use after deletion, etc)?

comment:22 by pulkomandy, 9 years ago

Blocking: 11234 added

comment:23 by ttcoder, 9 years ago

I'm stuck.

1) First wanted to start nice with a 'live' restart of input_server (instead of rebooting); but when I invoke /system/servers/input_server -q nothing happens whatsoever. Does it work for others ? Here the mouse movement does not get frozen for even a fraction of a second, and nothing gets output to syslog, hence I believe input_server is refusing to restart. (EDIT: also tried in an old 45943 and there it works... half of the time; the other half I get "locked out" of the system as neither keyboard nor mouse work any more; point is, input_server used to kinda sorta restart, but now it does not any more)

2) Next I gave up on the live restart for now and tried to replace/override the system's instance, see what would occur after rebooting, but there is still nothing output to Terminal; here's what I did:

  • black-listed the system instance
  • made a copy of the system instance of input_server, hoping it would get picked up by signature (it does, see below).
  • looked in syslog, no sign whatsoever of input_server logging, even though there is a libbe.so built correctly next to the server (see below)
~/Desktop> ps
Team                                                  Id #Threads  Gid  Uid 
kernel_team                                            1       47    0    0 
 (..)
/boot/system/servers/app_server                      399       48    0    0 
/boot/system/servers/syslog_daemon                   418        2    0    0 
/boot/home/Desktop/hrev_input-server-CUSTDEV/_test_sandbox/inpu   425       10    0    0 

~/Desktop> cd hrev_input-server-CUSTDEV/_test_sandbox
~/Desktop/hrev_input-server-CUSTDEV/_test_sandbox> ll
total 260
-r-xr-xr-x 1 user root 253202 Apr 16 05:31 input_server
drwxr-xr-x 1 user root   2048 Apr 19 18:54 lib
~/Desktop/hrev_input-server-CUSTDEV/_test_sandbox> strings lib/libbe.so | grep B
Path
 (..)
BPathMonitor: BPathMonitor::StartWatching(%s, %lx)
BPathMonitor: BPathMonitor::StopWatching(%s)
BPathMonitor: Create PathMonitor locker
BPathMonitor: Start PathMonitor looper
Q38BPrivate12BPathMonitor18BWatchingInterface
~/Desktop/hrev_input-server-CUSTDEV/_test_sandbox> 

Any idea ?

Note -- the bug seems to no longer occur now that I upgraded to 49041, went back lurking in the shadows :-/ But getting help on the above would still be useful for the day when it comes back.

Last edited 9 years ago by ttcoder (previous) (diff)

comment:24 by mmlr, 9 years ago

This might be resolved by hrev49058. Please retest if you were able to reproduce this in some way.

Regarding restarting of the input_server the code handling "-q" seems very questionable. However you can just use hey to quit the input_server ("hey input_server quit"). It may block waiting for the mouse to move to wake up the listener thread, in that case just move the mouse. Once it quit, the app_server should immediately restart it and you might not even notice that it was gone. You will see the original main thread of the quit input_server in the reply printed by hey and you can check that the running input_server has a different main thread/team id to be sure.

PS: If you edit your comments, no email notification will go out for that change. For people staying in the bugtracker loop purely by reading the mailing list (like me), there's a very high chance of these edits not being noticed. Therefore please avoid adding anything relevant to tickets via comment edits (style cleanup and typos are fine for edits).

comment:25 by diver, 9 years ago

/system/servers/input_server -q is not very reliable, see #10926.

comment:26 by vidrep, 9 years ago

I just had it happen using hrev49058 x86_gcc2. Apparently it is still unresolved.

comment:27 by diver, 9 years ago

Unfortunately, it's still here in hrev49133. When observed I set PathMonitor looper thread to idle priority and then /system/servers/input_server -q. This made system extremely slow and unresponsive. After a few minutes I managed to open ProcessController and that thread priority was set to 1.

comment:28 by diver, 9 years ago

BTW, should we close it as a dupe of #11049?

comment:29 by humdinger, 9 years ago

FWIW, I haven't seen a pegging PathMonitor looper for many weeks now. I used to get it approx. on every 5th bootup...

comment:30 by vidrep, 9 years ago

Just had it happen on first boot after pkgman full-sync to hrev49318. It does not happen as often, but that could also be because I am using Haiku less and less due to all the things that are broken.

comment:31 by vidrep, 9 years ago

Had it happen again yesterday with hrev49321. Attached are the syslogs from the last freeze.

by vidrep, 9 years ago

Attachment: previous_syslog.5 added

by vidrep, 9 years ago

Attachment: syslog.6 added

comment:32 by vidrep, 9 years ago

I have been trying to narrow down the range of nightly builds to when this issue first appeared. Jessica first reported it using hrev47514 (#11049), and my ticket was with hrev47853 (#11234). Unfortunately I did not create a ticket upon seeing it for the first time, but it was certainly within a period of 2 or 3 of weeks. I have installed build hrev47479, which is just prior to Akshay's libUSB port, and so far, I have not seen the problem appear. Where can I obtain a builds of 47483 and 47630?

comment:33 by vidrep, 9 years ago

hrev49337 x86_gcc2 required two reboots today before I got a working mouse. Syslog (attached). hrev47479 x86_gcc2 has been booted several times per day without any appearance of the issue discussed. I would like to obtain a copy of hrev47483 to confirm or disprove my belief that the usb commits introduced in hrev47481 and hrev47483 are in fact the source of the bug.

Last edited 9 years ago by vidrep (previous) (diff)

by vidrep, 9 years ago

Attachment: syslog.7 added

in reply to:  33 comment:34 by mmlr, 9 years ago

Replying to vidrep:

I would like to obtain a copy of hrev47483 to confirm or disprove my belief that the usb commits introduced in hrev47481 and hrev47483 are in fact the source of the bug.

Since these changes only really concern xHCI, which is not part of the image, I rather doubt that they are responsible. The only change to the overall stack is the addition of the controller_cookie argument to the Hub class which defaults to the same default as the one in the Device class and should therefore make no logical difference to the existing code.

comment:35 by humdinger, 9 years ago

I suspect it's totally unrelated, but I used to boot Haiku via grub2 and haven't seen the pegging PathMonitor looper for weeks, as I wrote a few comments above. Now I'm using bootman and have the pegging thread maybe every 3rd bootup... "Conicidinc"? (This is with hrev49292)

comment:36 by vidrep, 9 years ago

This is probably the single most annoying bug in Haiku right now, due to the need to reboot again ( as many as six times on one occasion). I'm trying to nail down a range of when the bug first appeared. It may take a while due to the intermittent nature of the problem. My recollection was that I fist saw it sometime in mid-August 2014, during which time many changes were made to USB and the kernel scheduler.

comment:37 by vidrep, 9 years ago

I have seen the problem on hrev47479 three times while testing today, including on consecutive reboots. So, obviously it predates that build. Is there a repository where older builds can be downloaded? I have attached the syslogs from that testing session in case they may provide useful information.

by vidrep, 9 years ago

Attachment: 47479_syslog added

by vidrep, 9 years ago

Attachment: 47479_previous_syslog added

by vidrep, 9 years ago

Attachment: 47479_syslog.old added

in reply to:  37 comment:38 by humdinger, 9 years ago

Replying to vidrep:

Is there a repository where older builds can be downloaded?

On the page of the nightly build, there's a link to older images at the bottom.

comment:39 by vidrep, 9 years ago

Yes, I am aware of it and have already installed hrev47380. The problem is that it will be difficult to narrow the range any further, so as to determine approximately when the problem first occurred. Right now the range is somewhere between hrev47380 and hrev47479.

comment:40 by vidrep, 9 years ago

I have narrowed it down a little further. The x86_gcc4 hybrid archive allowed me to test hrev47458, which froze on the first boot after installation to my hard drive, and on 2 of 4 boots immediately thereafter. Hrev47380 continues to boot time and again without issue. So, the range is somewhere between hrev47380 and hrev47458.

comment:41 by vidrep, 9 years ago

Happened again...hrev49413 x86_gcc2 Another syslog attached. Syslog says: hda: buffer_exchange: Error waiting for playback buffer to finish (Interrupted system call)! ???

Version 0, edited 9 years ago by vidrep (next)

by vidrep, 9 years ago

Attachment: previous_syslog_49413 added

by vidrep, 9 years ago

Attachment: syslog_49413 added

comment:42 by diver, 9 years ago

Blocking: 12355 added

comment:43 by vidrep, 9 years ago

The problem is still present in hrev49627. Sometimes multiple boot attempts are required to get a functional mouse or keyboard. The problem also appears to have evolved somewhat. Previously it only happened at boot. Now it intermittently happens after the system has been up and running - one CPU core will be pegged at 100%, however keyboard and mouse are still working. Under these conditions I see the CPU cores pegging at 100%, alternating in random fashion in Pulse.

comment:44 by vidrep, 9 years ago

I have noted that this bug more often that not shows up on the reboot after an update. i.e. pkgman -> update or pkgman ->full-sync Yesterday, for example I had to reboot 5 times before I had a mouse. Keyboard always seems to work. If I drop into KDL is there anything you would suggest I try?

comment:45 by vidrep, 9 years ago

I had it happen again today. I dropped into KDL and tried syslog | tail (photo attached). Maybe the output will give a clue, maybe not.

by vidrep, 9 years ago

Attachment: IMG_0130.JPG added

comment:46 by vidrep, 9 years ago

Attached are a couple more photos I took today of KDL output from a pegging CPU on hrev49662 x86_gcc2. This time I noticed that one CPU core pegged at 100% after running a few minutes, but without any effect on the keyboard or mouse. Eventually each core will alternate going to 100%.

by vidrep, 9 years ago

Attachment: tail_teams.JPG added

by vidrep, 9 years ago

Attachment: threads.JPG added

comment:47 by vidrep, 9 years ago

Today when it happened I invoked KDL and ran a few back traces. See attached PathMonitor_serial_log.

by vidrep, 9 years ago

Attachment: PathMonitor_serial_log.txt added

comment:48 by vidrep, 9 years ago

It happened on consecutive boots. I have attached a syslog with KDL from the second session.

by vidrep, 9 years ago

Attachment: syslog.8 added

comment:49 by ttcoder, 9 years ago

PathMonitor_serial_log.txt​ contains the bt of possibly the wrong "path monitor looper" thread (there are several), but syslog.8 has some interesting stuff:

Its backtrace locates the input_server PathMonitorLooper in the same place as with diver in comment:11 and also same as me in comment:20 , that is to say, in GetAncestor()

Also, said "GetAncestor()" call gets interrupted and the last function call in the stack is to process_pending_ici().. Might be the same "ICI" that bugs vidrep in his other ticket, or just a coincidence? (I dunno what ICI is).

Will try a couple more things in off-site emails with vidrep

comment:50 by vidrep, 9 years ago

Would it be possible to have a input_server debug build of Haiku made available for testing? Instructions on how to do it myself would be welcome.

comment:51 by ttcoder, 9 years ago

Here's a tentative outline for others to flesh out if I missed something:

  • Get the source (if not already downloaded previously): create a folder and from within it, run git clone git://git.haiku-os.org/haiku (if on the other hand you already had the source, just run "git pull" to make sure it's up to date).
  • configure the build of the input_server component: locate the file "haiku/build/jam/UserBuildConfig" and add this line inside: SetConfigVar DEBUG : HAIKU_TOP src servers input_server : 3 : global ; , keeping it exactly like this (don't forget the white spaces anywhere).

If that doesn't work for you (didn't for me), instead locate the file "haiku/src/servers/input/Jamfile" and add the DEBUG variable as a line around third position such that it looks like this:

SubDir HAIKU_TOP src servers input ;
SetSubDirSupportedPlatformsBeOSCompatible ;

DEBUG = 1 ;
  • run jam input_server ; check that it produces its output in "haiku/generated/objects/haiku/x86_gcc2/debug_1/...." and NOT in "haiku/generated/objects/haiku/x86_gcc2/release/...".

Once you successfully have a debug build of input_server, we'll look into having it executed in place of the normal one; shouldn't be trouble as that one can be picked by signature, so we probably won't need to set you up with an .hpkg file ..etc.

comment:52 by pulkomandy, 9 years ago

In your UserBuildConfig example you are setting DEBUg to 3, so the file will be in debug_3 (not release or debug_1). I'm not sure if the input_server does things differently with different debug levels, usually I only use the value 1. Maybe that was your problem with UserBuildConfig, because I never had issues with setting DEBUG that way. Also note that you need to create the file, if it doesn't exist yet (in the git repo there is only a sample file named UserBuildConfing.sample).

Using UserBuildConfig is better, because it is a "private" file and you can easily keep your settings there (for testing or customizing your build). If you change the Jamfile directly, you can mistakenly include the change in a patch/commit, and also you can get git merge conflicts if someone else changes the file.

comment:53 by vidrep, 9 years ago

I did everything as per the above instructions. With the "UserBuildConfig" script it creates the input server, but in "haiku/generated/objects/haiku/x86_gcc2/release/..." directory. Doing it the other way with the "Jamfile" results in the following error:

...failed C++ /boot/home/haiku/generated/objects/haiku/x86_gcc2/debug_1/servers/input/BottomlineWindow.o ...

BUILD FAILURE: ...failed updating 4 target(s)... ...skipped 1 target(s)... ~/haiku>

comment:54 by ttcoder, 9 years ago

What is the gcc error (above the "failed C++...." line) exactly; What do the first few lines of the Jamfile look like, quote it in here and we'll check wether the DEBUG = 1 ; line includes white spaces ..etc as needed (if you want you can do "clean formatting" quotting by clicking the 5th icon from the left of the "Bold Italics Anchor ..." row and pasting the quote between the created curly braces).

comment:55 by ttcoder, 9 years ago

Well what do I know! :-) The gcc error (emailed off-site) is not a configuration problem but an actual problem in the code: all the errors occur between #ifdef DEBUG and #endif statement, so it's very probably a case of debug code rot, no mystery. Those tend to get less attention than release for obvious reasons.

I assume vidrep should file another ticket titled "input_server does not build with DEBUG=1" as I've seen that happen before in the ticket timeline for other components. Once that other ticket is resolved we can come back to this one.

EDIT: #12419 has been filed

Last edited 9 years ago by ttcoder (previous) (diff)

in reply to:  52 comment:56 by vidrep, 9 years ago

Replying to pulkomandy:

In your UserBuildConfig example you are setting DEBUg to 3, so the file will be in debug_3 (not release or debug_1). I'm not sure if the input_server does things differently with different debug levels, usually I only use the value 1. Maybe that was your problem with UserBuildConfig, because I never had issues with setting DEBUG that way. Also note that you need to create the file, if it doesn't exist yet (in the git repo there is only a sample file named UserBuildConfing.sample).

Using UserBuildConfig is better, because it is a "private" file and you can easily keep your settings there (for testing or customizing your build). If you change the Jamfile directly, you can mistakenly include the change in a patch/commit, and also you can get git merge conflicts if someone else changes the file.

I did as instructed but the debug build of input_server failed. I have attached a copy of the build log, jamfile and UserBuildConfig files that were used. Perhaps somebody can fix them if they are incorrect, and attach them to the ticket for download. Thanks.

by vidrep, 9 years ago

Attachment: Jamfile added

by vidrep, 9 years ago

Attachment: UserBuildConfig added

by vidrep, 9 years ago

Attachment: jam_input_server added

comment:57 by pulkomandy, 8 years ago

There is an error in your line in UserBuildConfig, you have "src servers input_server" but it is "scr servers input". I just fixed the debug build in hrev49717.

comment:58 by vidrep, 8 years ago

After doing a git pull and modifying that line in the UserBuildConfig, it now creates a debug build if input_server in haiku/generated/objects/x86_gcc2/debug_1/ directory. What is the next step to take? Pulkomandy? ttcoder?

comment:59 by pulkomandy, 8 years ago

It should be possible to put the generated input_server in /boot/system/non-packaged/servers/, it should be picked by the system at the next boot. Then, when you can reproduce the problem, attach Debugger to the input_server and save a debug report (if you can get either keyboard or mouse working, this should be possible).

We can then look at where exactly the input_server is stuck.

If you need someone to connect to the machine to help analyze things, you can set up an ssh server

  • Set a password using the "passwd" command for your user
  • Make sure "ssh server" is set to "on" in the network preferences
  • Configure your network (modem/router) so TCP port 22 is routed from the outside to your machine
  • Now the machine is open (password protected) to the internet, and anyone with the login and password can connect to it and get a terminal session. From there it's possible to inspect the situation, and extract the relevant data.

comment:60 by vidrep, 8 years ago

Just out of curiosity; is there anything in PVS or Coverity that could point to a problem in the input server? When this problem occurs after a boot, the mouse is always disabled. The keyboard usually continues to work. From there I can invoke kernel debugging to execute commands and capture the output from a serial port.

comment:61 by pulkomandy, 8 years ago

You can run a terminal (use the "menu" key to open deskbar, navigate to terminal) and do "Debugger -s --team nnn" where nnn is the team ID of the input server (you can find it using "ps | grep input_server" for example - it is the second column there). This will save a report on the desktop.

You can also use "ps -a | grep PathMonitor" to see the path monitors, most likely one of them will not be in the "wait" state. This would be the one hogging the CPU, so you can use "Debugger -s --thread nnn" where nnn is the thread identifier of that thread (again, second column in ps output).

I had a look at PVS results (http://pulkomandy.github.io/) and a search for input_server only leads to some add-ons (CommandActuators and TabletDevice), not the server itself.

comment:62 by vidrep, 8 years ago

I followed your instructions about moving the debug input_server into a non-packaged servers directory. Getting the problem to manifest itself didn't take long (it happened after the second boot attempt). I ran the commands in your instructions and created a pair of debug reports (attached). In the second instance, it caused a complete lockup of the system after executing the command. If the attached information is insufficient, let me know what further steps to take.

comment:63 by diver, 8 years ago

The first report is a Debugger crash and should be filed as a separate ticket :) The first line of the second debug report tells us that Haiku loaded /system/servers/input_server instead of the one you put into non-packaged directory, so it looks like that trick didn't work.

comment:64 by ttcoder, 8 years ago

I'm guessing you should "blacklist" the system input_server. If that makes the system become unbootable/unusable for some reason you can revert by booting to another partition and restoring things anyway. So create a /system/settings/packages file, with the normal structure (it's listed on haiku-os.org IIRC) and add a line to it that look like this: servers/input_server

comment:65 by vidrep, 8 years ago

I blacklisted the input server send am apparently using the debug build. When attempting a reboot it hangs with an error message, "The application "input_server" might be blocked on a modal panel.

comment:66 by vidrep, 8 years ago

I was logging to a serial port the second time around when the problem manifested itself. I have attached the serial log and a debug report.

by vidrep, 8 years ago

Attachment: 10242015.txt added

comment:67 by ttcoder, 8 years ago

The .report mentions /boot/home/input_server (so you've put it there instead of non-packaged? should work just as well anyway), so if that kind of behavior always occurs at reboot time it'd mean the bug is now reproducible directly at reboot time, which is kinda good news.

No dump of variable states in the .report, but the serial logging does contains output, like

CALLED void InputServer::_DispatchEvents(EventList &)
CALLED status_t InputServer::_DispatchEvent(BMessage *)`

..etc

so it might be possible to find some useful info in there

comment:68 by diver, 8 years ago

I'm not sure but maybe you also need to enable debug for SetConfigVar DEBUG : HAIKU_TOP src kits storage : 1 : global ;" in UserBuildConfig? then you need to jam -q libbe.so and place it into /system/non-packaged/lib

Last edited 8 years ago by diver (previous) (diff)

comment:69 by vidrep, 8 years ago

The strange thing is that I copied the input_server into /boot/system/non-packaged/servers/, but after blacklisting the input_server, it didn't find that one on reboot, but instead found the one in my home directory.

comment:70 by diver, 8 years ago

Was it built with debug info enabled?

comment:71 by vidrep, 8 years ago

Yes, exactly as described above DEBUG=1 The UserBuildConfig is attached to the ticket The only change since it was attached is the error noted in comment 57

comment:72 by diver, 8 years ago

I'd give comment:68 a try :) This would require replacing libbe.so though. Maybe it would be enough to put it in /system/non-packaged/lib.

comment:73 by vidrep, 8 years ago

Attached is another debug report. It appears somewhat different.

comment:74 by diver, 8 years ago

Weird, there is no PathMonitor looper thread which is the one we need.

comment:75 by pulkomandy, 8 years ago

WRT to how the input_server is found at boot: the launch daemon takes care of it, and does a search by application signature. This means the first app matching the signature, anywhere on your boot disk, will be used, in your case it was a copy in the home directory, apparently.

Now that you know how to build haiku, an useful thing to try is using git bisect to further narrow down which commit created the problem.

From your Haiku source directory, run:

git bisect start
git bisect good hrev47380
git bisect bad hrev47458

This will checkout a version of Haiku in between these two. You can then build it (run jam with the usual options) and boot the generated build. If you hit the problem, come back to your source dir and tell git:

git bisect bad

If the tested revision works, enter:

git bisect good

This will allow to pinpoint the exact commit that broke things. Git will even tell you at each step how many builds you still have to try.

---

The debug builds of libbe.so and src/kits/storage are a good idea too (as suggested above). It is important to use the second command (to get a debug report for the path monitor thread specifically). And yes, since doing so stops the input server, it's expected that it appears to freeze the system (keyboard and mouse will stop working). If you have ssh access to the machine from another computer, you could use that to restart the input_server at that point. After a reboot (you can shut down the machine cleanly by pressing the power button), you will still get the debug report on the desktop, even if the system appears frozen.



comment:76 by vidrep, 8 years ago

Attached are a pair of debug reports I generated last night before calling it a day. Whether they're helpful or not is up to you guys. Later today I'll try doing the git bisect as instructed in comment 75. Why did I open my big mouth and say I'd be willing to do the legwork?????

comment:77 by vidrep, 8 years ago

Question: Could the error message on shutdown in comment 65 be at all related to ticket #12306.

comment:78 by vidrep, 8 years ago

I tried the git bisect, but when I try to build the test image it fails. jam -q haiku-anyboot-image Using this same command works fine on the latest nightly. Has there been a change that would cause the older builds to fail? I'll post the build log later today after I get home from work.

comment:79 by diver, 8 years ago

I'd still go with comment:68 first as it sounds way easier.

comment:80 by vidrep, 8 years ago

I created a debug build of libbe.so as per the instructions. The problem manifested itself after the second boot (as usual). Attached is a syslog.

by vidrep, 8 years ago

Attachment: syslog.9 added

comment:81 by vidrep, 8 years ago

Happened again today using the debug build of libbe.so and input_server. Attached is a listimage showing that the debug builds were loaded. Created a pair of debug reports as per comment 61(attached)

by vidrep, 8 years ago

Attachment: listimage added

in reply to:  81 comment:82 by mmlr, 8 years ago

Replying to vidrep:

Created a pair of debug reports as per comment 61(attached)

The second one just has all threads running, so isn't really helpful.

From the first one the cause of the high CPU use becomes obvious: The AncestorMap becomes corrupted as one of the contained Ancestors has a hash next link pointing to itself causing an endless loop in Lookup().

Why this happens is harder to figure out. Since it's only reproducible irregularly in your case and not at all for others it is probably timing or setup dependent. Possibly a race condition due to missing locking. I've looked through the PathMonitor code but nothing obvious jumped out.

Enabling TRACE_PATH_MONITOR in PathMonitor.cpp might give some info (output to the serial log) although the TRACE statements are scarse. Removing the // in front of that line and rebuilding libbe should enable that.

comment:83 by vidrep, 8 years ago

mmlr,

I did as instructed and rebuilt libbe.so a second time.

I have attached a debug report using the new lib. I think you're mistaken in assuming that this issue is reproducible irregularly and that I am the only one experiencing this problem. In my case, I see this issue on at least 50% of boots. Not just on one PC, but several PC's I have tested in the past year.

comment:84 by diver, 8 years ago

TRACE_PATH_MONITOR outputs to the serial log.

comment:85 by vidrep, 8 years ago

Syslog(s) attached. Thanks.

by vidrep, 8 years ago

Attachment: syslog.10 added

by vidrep, 8 years ago

Attachment: syslog.3.old added

by vidrep, 8 years ago

Attachment: previous_syslog.6 added

comment:86 by mmlr, 8 years ago

Well, irregularly in the sense of it not being reproducible 100% of the time, i.e. there's some kind of variable component that decides if it happens or not.

Unfortunately the logs you attached last do not contain any BPathMonitor trace output. Are you sure you rebuilt libbe.so, replaced the previous copy in non-packaged and booted with the one from the Haiku hpkg blacklisted?

There should be output lines starting with "BPathMonitor:" (a lot of them).

comment:87 by vidrep, 8 years ago

So, just to be sure... I have to go to: haiku/src/kits/storage/PathMonitor.cpp and delete the two slashes in line 38 #define TRACE_PATH_MONITOR, add the line "SetConfigVar DEBUG : HAIKU_TOP src kits storage : 1 : global ;" in UserBuildConfig, then build libbe.so using the command jam -q libbe.so., then move the new lib into /system/non-packaged/lib and reboot, making sure to blacklist the lib at the boot menu. Correct? I'll give it another try again tonight. Is there anything I should look for, to be sure the output to the syslog is what we want? I have my PC set up for serial logging on a Windows PC. I'll capture that output as well. Thanks guys for walking me through this whole schmozzle.

comment:88 by mmlr, 8 years ago

Yes, pretty much. Blacklisting the original file of course (system/lib/libbe.so), not the one you put in non-packaged. You don't strictly need to set the debug flag, as this tracing doesn't depend on it when uncommented by removing the slashes, but it doesn't hurt either.

As stated above there should be a lot of lines starting with "BPathMonitor:" when the change took.

comment:89 by vidrep, 8 years ago

I rebuilt libbe.so and it now appears to be working. First, I experienced the CPU pegging issue, and captured the serial output on a Windows PC (11052015.txt attached). Second, I ran the debugger and created a report (attached). Third, the system refused to shutdown (ticket #12306). Maybe we'll get lucky and find the cause of both issues. Let me know if any more steps can be taken at my end.

by vidrep, 8 years ago

Attachment: 11052015.txt added

comment:90 by mmlr, 8 years ago

Thanks for the logs. Unfortunately they were not quite as helpful as I hoped for.

I came up with a synthetic test case for the BPathMonitor which produced other data structure related crashes. These are fixed in hrev49756. The hash table corruption seen here might very well have been caused by the same problem. Please check if said revision fixes the problem for you.

If hrev49756 doesn't fix the problem I'll prepare a patch that adds earlier detection of the corruption of the hash table which would then hopefully shed some light on the producer of the corruption and not just its victim.

comment:91 by vidrep, 8 years ago

So far, so good. 40+ warm/cold boots without a problem. Let's keep the ticket open to see if the issue manifests itself again over the long term.

comment:92 by vidrep, 8 years ago

It's been almost 1 week and no sign of the problem. Lets assume this is fixed. Thanks.

comment:93 by pulkomandy, 8 years ago

Resolution: fixed
Status: newclosed

Thanks for reporting and helping troubleshoot this!

Note: See TracTickets for help on using tickets.