Opened 10 years ago

Closed 10 years ago

Last modified 15 months ago

#10977 closed bug (fixed)

kdl in pthread_cond_wait on netsurf buildslave

Reported by: pulkomandy Owned by: bonefish
Priority: normal Milestone: R1
Component: System/Kernel Version: R1/Development
Keywords: Cc: hamish
Blocked By: Blocking: #11000
Platform: All

Description

Anothre KDL from NetSurf's build slave. We have tried to extract as much info as possible over IRC. This happens somewhere in OpenJDK. It seems the condition variable doesn't exist and two threads are accessing it.

This is on a virtual machine installed as follows: http://wiki.netsurf-browser.org/Continuous_Integration_Manual_Haiku_Slave_Setup

Attachments (5)

1_kdl (11.1 KB ) - added by pulkomandy 10 years ago.
Backtraces of the two threads and KDL message
2_cvar (1.0 KB ) - added by pulkomandy 10 years ago.
cvars list and attempt to access the cvar
3_area (661 bytes ) - added by pulkomandy 10 years ago.
Attempt to list the area.
haiku-kdl-20140701.log (48.0 KB ) - added by pulkomandy 10 years ago.
Same panic occured again, attaching complete syslog.
10977.cpp (2.1 KB ) - added by bonefish 10 years ago.
Test program to reproduce the issue

Download all attachments as: .zip

Change History (14)

by pulkomandy, 10 years ago

Attachment: 1_kdl added

Backtraces of the two threads and KDL message

by pulkomandy, 10 years ago

Attachment: 2_cvar added

cvars list and attempt to access the cvar

by pulkomandy, 10 years ago

Attachment: 3_area added

Attempt to list the area.

comment:1 by bonefish, 10 years ago

The immediate cause of the panic is just a missing

DEBUG_PAGE_ACCESS_END(context.page);

in vm_soft_fault() before unlocking everything when having to wait for a to-be-unmapped page to become unwired.

Unfortunately that will leave another issue to be resolved: The page we want to unmap isn't actually wired in this case. We have two threads that want to wire the same virtual page for writing. The way wire_page() works, they both first mark the respective address range wired before calling vm_soft_fault() to map a writable page (there's only a readable one from a lower cache). Either thread ignores its own pre-wired range, but not that of the other thread. Hence the read-only page looks wired and cannot be unmapped. Both threads would wait forever.

Not sure how involved a solution would be. It might be possible to mark pre-wired ranges respectively and ignore them in vm_soft_fault(), but this needs to be thought through thoroughly (particularly when to unmark the ranges).

by pulkomandy, 10 years ago

Attachment: haiku-kdl-20140701.log added

Same panic occured again, attaching complete syslog.

comment:2 by pulkomandy, 10 years ago

Attached another backtrace of apparently the same or a similar problem.

comment:3 by diver, 10 years ago

Cc: hamish added

comment:4 by bonefish, 10 years ago

Please don't attach more stuff to this ticket (unless it's a solution or a small test case). The problem itself is understood, as documented in comment:1. Adding more comments/attachments will just bury that comment.

If you run into an issue which you think might be different, please create a new ticket. The pattern here is: one thread panic()s in fault_get_page() while another thread waits in vm_soft_fault().

comment:5 by korli, 10 years ago

For reference, it would be nice to know the actual virtual machine software and version.

comment:6 by bonefish, 10 years ago

Blocking: 11000 added

(In #11000) Closing as duplicate of #10977.

by bonefish, 10 years ago

Attachment: 10977.cpp added

Test program to reproduce the issue

comment:7 by bonefish, 10 years ago

Attached a test program that fairly reliably reproduces the issue on (virtual) hardware with 3 or more CPUs. Tested with qemu.

Compile with:

g++ -Wall -o 10977 10977.cpp

Run with:

while ./10977; do true; done

comment:8 by bonefish, 10 years ago

Status: newin-progress

comment:9 by bonefish, 10 years ago

Resolution: fixed
Status: in-progressclosed

The ASSERT is no longer triggered with hrev48140 (it also fixes a previously not considered follow-up issue), the issue mentioned in comment:1 is fixed in hrev48145.

Note: See TracTickets for help on using tickets.