Opened 5 years ago

Closed 5 years ago

#10977 closed bug (fixed)

kdl in pthread_cond_wait on netsurf buildslave

Reported by: pulkomandy Owned by: bonefish
Priority: normal Milestone: R1
Component: System/Kernel Version: R1/Development
Keywords: Cc: hamish
Blocked By: Blocking: #11000
Has a Patch: no Platform: All

Description

Anothre KDL from NetSurf's build slave. We have tried to extract as much info as possible over IRC. This happens somewhere in OpenJDK. It seems the condition variable doesn't exist and two threads are accessing it.

This is on a virtual machine installed as follows: http://wiki.netsurf-browser.org/Continuous_Integration_Manual_Haiku_Slave_Setup

Attachments (5)

1_kdl (11.1 KB) - added by pulkomandy 5 years ago.
Backtraces of the two threads and KDL message
2_cvar (1.0 KB) - added by pulkomandy 5 years ago.
cvars list and attempt to access the cvar
3_area (661 bytes) - added by pulkomandy 5 years ago.
Attempt to list the area.
haiku-kdl-20140701.log (48.0 KB) - added by pulkomandy 5 years ago.
Same panic occured again, attaching complete syslog.
10977.cpp (2.1 KB) - added by bonefish 5 years ago.
Test program to reproduce the issue

Download all attachments as: .zip

Change History (14)

Changed 5 years ago by pulkomandy

Attachment: 1_kdl added

Backtraces of the two threads and KDL message

Changed 5 years ago by pulkomandy

Attachment: 2_cvar added

cvars list and attempt to access the cvar

Changed 5 years ago by pulkomandy

Attachment: 3_area added

Attempt to list the area.

comment:1 Changed 5 years ago by bonefish

The immediate cause of the panic is just a missing

DEBUG_PAGE_ACCESS_END(context.page);

in vm_soft_fault() before unlocking everything when having to wait for a to-be-unmapped page to become unwired.

Unfortunately that will leave another issue to be resolved: The page we want to unmap isn't actually wired in this case. We have two threads that want to wire the same virtual page for writing. The way wire_page() works, they both first mark the respective address range wired before calling vm_soft_fault() to map a writable page (there's only a readable one from a lower cache). Either thread ignores its own pre-wired range, but not that of the other thread. Hence the read-only page looks wired and cannot be unmapped. Both threads would wait forever.

Not sure how involved a solution would be. It might be possible to mark pre-wired ranges respectively and ignore them in vm_soft_fault(), but this needs to be thought through thoroughly (particularly when to unmark the ranges).

Changed 5 years ago by pulkomandy

Attachment: haiku-kdl-20140701.log added

Same panic occured again, attaching complete syslog.

comment:2 Changed 5 years ago by pulkomandy

Attached another backtrace of apparently the same or a similar problem.

comment:3 Changed 5 years ago by diver

Cc: hamish added

comment:4 Changed 5 years ago by bonefish

Please don't attach more stuff to this ticket (unless it's a solution or a small test case). The problem itself is understood, as documented in comment:1. Adding more comments/attachments will just bury that comment.

If you run into an issue which you think might be different, please create a new ticket. The pattern here is: one thread panic()s in fault_get_page() while another thread waits in vm_soft_fault().

comment:5 Changed 5 years ago by korli

For reference, it would be nice to know the actual virtual machine software and version.

comment:6 Changed 5 years ago by bonefish

Blocking: 11000 added

(In #11000) Closing as duplicate of #10977.

Changed 5 years ago by bonefish

Attachment: 10977.cpp added

Test program to reproduce the issue

comment:7 Changed 5 years ago by bonefish

Attached a test program that fairly reliably reproduces the issue on (virtual) hardware with 3 or more CPUs. Tested with qemu.

Compile with:

g++ -Wall -o 10977 10977.cpp

Run with:

while ./10977; do true; done

comment:8 Changed 5 years ago by bonefish

Status: newin-progress

comment:9 Changed 5 years ago by bonefish

Resolution: fixed
Status: in-progressclosed

The ASSERT is no longer triggered with hrev48140 (it also fixes a previously not considered follow-up issue), the issue mentioned in comment:1 is fixed in hrev48145.

Note: See TracTickets for help on using tickets.