Opened 12 years ago

Closed 10 years ago

Last modified 10 years ago

#8007 closed bug (fixed)

[app_server] fully unresponsive when resizing window of big textfile

Reported by: ttcoder Owned by: jua
Priority: normal Milestone: R1
Component: System/Kernel Version: R1/Development
Keywords: Cc: pete.goodeve@…, mmlr
Blocked By: Blocking: #7882, #8136
Platform: All

Description

After opening a 9 MiB textfile, if you resize the StyledEdit window the OS freezes for a while: the mouse pointer becomes stuck and the rest of the display is not refreshed either.

Attachments (7)

sched.rec (1.3 MB ) - added by Pete 12 years ago.
scheduling_recorder output from StyledEdit freeze
Mutex-priority-inheritance_version2.patch (18.7 KB ) - added by jua 12 years ago.
Patch implementing waiters priority sorting and mutex priority inheritance
Mutex-priority-inheritance_version3.patch (18.7 KB ) - added by jua 12 years ago.
Version 3 of patch implementing priority inheritance for mutexes and priority sorting for mutex waiters
Mutex-priority-inheritance_version4.patch (19.3 KB ) - added by jua 12 years ago.
Version 4 of patch implementing priority inheritance for mutexes and priority sorting for mutex waiters
Mutex-priority-inheritance_version5.patch (31.4 KB ) - added by jua 11 years ago.
Patch version 5
0001-Reduce-lock-contention-in-kernel-port-subsystem.patch (38.5 KB ) - added by jua 10 years ago.
Patch 0001: port system lock restructuring
0002-Move-B_MOUSE_IDLE-generation-to-app_server.patch (6.4 KB ) - added by jua 10 years ago.
Move B_MOUSE_IDLE generation to app_server

Download all attachments as: .zip

Change History (122)

comment:1 by ttcoder, 12 years ago

Steps: You may for instance...

Expected behavior:

  • the OS should stay responsive, or at least sluggish (a la Windows) but not completely frozen, while StyledEdit recalculates its Frame() and layout.

Actual Behavior:

  • the mouse pointer stops moving completely for dozens of seconds.
  • keyboard unresponsive as well.
  • the screen itself stops refreshing.

Basically the whole computer is frozen for up to a minute if the file is big enough.

Hardware: This is an AthlonXP 1800+ with 512 MB of RAM. So only one "core" (or "CPU") in case it matters (and maybe it does, seeing the experiences in #7882 ? there seems to be an emerging pattern...)

To be clear, there might be misbehaving (or at least unoptimized) code in StyledEditwindow::FrameResize(), who knows.. But the app_server really shouldn't be vulnerable to that, even from B_DISPLAY_PRIORITY threads! So I'm filing this against app_server initially (I have no idea if this problem goes deeper, into actual management of threads in the kernel?).

To reproduce this bug, those with dual-core systems might want to open Pulse and disable one of the cores I guess.

This is with hrev42760 .

P.S. I swear I was not working on extending my torture tests in #7882 :-)

comment:2 by Pete, 12 years ago

(...following ttcoder's link from #7285...) The steps above are exactly reproducible on my machine (single Pentium 4, 2.79MHz 512MB).

As a further datapoint or two, I telnet'ed in from my Linux machine and ran 'top'. This showed StyledEdit using fully 100% of the cpu (105% at a few points!), with nothing else showing any significant usage.

I also did a (remote) 'ps -a' while StyledEdit had the Haiku screen locked. I couldn't see any high priority task associated with StyledEdit. (All it's threads were ~15-18.)

Despite the fact that I could have top et al running remotely while Haiku was locked, trying my test (command-line-only) ToneProducer remotely didn't work any better than running from a Haiku Terminal -- silence with quite a few 'crackles' in both cases (and debug printout showing buffers late by --eventually -- 24 seconds!!).

Obviously this shouldn't be happening, even if StyledEdit is going wild, because it is low priority, and both display and audio should be higher.

comment:3 by Pete, 12 years ago

Another observation -- perhaps not entirely unexpected:

I opened the big text file in StyledEdit, after removing the StyledEdit-info attribute so it wouldn't immediately freeze. Then, before resizing, I lowered the priority of the resulting window (using the 'prio' command so I could have finer control than with ProcessController).

If I set the priority to 10, resizing the window never froze. OTOH, it wasn't very good at refreshing the contents (not surprisingly). It seemed to, eventually, but I don't know the trigger.

With the priority between 11 and 14, there was freezing once again, but it seemed to vary with the priority level, going back to the usual freeze when it reached 15 again.

I tried looking to see if I could find a lower-priority task that might be affecting it but didn't find one.

Is it possible the scheduler isn't time-slicing properly?

comment:4 by Pete, 12 years ago

Cc: pete.goodeve@… added

in reply to:  3 ; comment:5 by bonefish, 12 years ago

Replying to Pete:

Is it possible the scheduler isn't time-slicing properly?

I would rather suspect a locking related issue.

There are debugging tools to analyze such a situation. The scheduling_recorder command line tool records scheduling data while it runs (only parameter in this case would be the output file into which it writes the data). It should be started immediately before running the test (i.e. the resizing itself) and stopped (Ctrl-C) when done. The resulting file is then passed as command line argument to DebugAnalyzer, a GUI app that allows to analyze various aspects of scheduling and locking behavior.

I haven't really used the tools for a long time. I quickly tested them while working in my signals branch, since I reworked some related kernel code and they seemed to work fine. Neither tool is in the default image.

in reply to:  5 ; comment:6 by Pete, 12 years ago

Replying to bonefish:

There are debugging tools to analyze such a situation. The scheduling_recorder command line tool records scheduling data while it runs (only parameter in this case would be the output file into which it writes the data). It should be started immediately before running the test (i.e. the resizing itself) and stopped (Ctrl-C) when done. The resulting file is then passed as command line argument to DebugAnalyzer, a GUI app that allows to analyze various aspects of scheduling and locking behavior.

I tried this, but I'm afraid DebugAnalyzer just crashed. I first did:

scheduling_recorder sched.rec StyledEdit costume-designers.list

and got a 5MB file "sched.rec" Then did:

DebugAnalyser sched.rec

The Terminal just spewed (and overflowed with) a lot of lines like:

Schedule event for unknown thread: 5387
Enqueued in run queue event for unknown thread: 5387
Schedule event for unknown previous thread: 5387
Schedule event for unknown thread: 5387
Enqueued in run queue event for unknown thread: 5387
Schedule event for unknown previous thread: 5387
Schedule event for unknown thread: 5387
Enqueued in run queue event for unknown thread: 5387
Schedule event for unknown thread: 5387
Enqueued in run queue event for unknown thread: 5387
Schedule event for unknown previous thread: 5387
Schedule event for unknown thread: 5387
Enqueued in run queue event for unknown thread: 5387
Schedule event for unknown previous thread: 5387
Schedule event for unknown thread: 5387
Enqueued in run queue event for unknown thread: 5387
Schedule event for unknown thread: 5387
Enqueued in run queue event for unknown thread: 5387
Schedule event for unknown previous thread: 5387
Schedule event for unknown thread: 5387
.....
Enqueued in run queue event for unknown thread: 5384
Schedule event for unknown thread: 5384
Enqueued in run queue event for unknown thread: 5383
Schedule event for unknown thread: 5383
Enqueued in run queue event for unknown thread: 5384
Enqueued in run queue event for unknown thread: 5383
Schedule event for unknown previous thread: 5383
Schedule event for unknown thread: 5384
Schedule event for unknown thread: 5383
Enqueued in run queue event for unknown thread: 5383
Schedule event for unknown previous thread: 5383
Schedule event for unknown thread: 5383
Schedule event for unknown previous thread: 5383
Enqueued in run queue event for unknown thread: 5382
Schedule event for unknown thread: 5382
Schedule event for unknown previous thread: 5382
Enqueued in run queue event for unknown thread: 5382
Schedule event for unknown thread: 5382
Schedule event for unknown previous thread: 5382

and went to the debugger. This was the trace:

Thread 5403 caused an exception: Segment violation
[....]
[Switching to team /boot/common/bin/DebugAnalyzer sched.rec (5389) thread model loader (5403)]
0x00266a2d in ModelLoader::_SetThreadEvents ()
(gdb) bt
#0  0x00266a2d in ModelLoader::_SetThreadEvents ()
#1  0x00265bb8 in ModelLoader::_Load ()
#2  0x002654bc in ModelLoader::Load ()
#3  0x00264e71 in AbstractModelLoader::_Loader ()
#4  0x00264e3f in AbstractModelLoader::_LoaderEntry ()
#5  0x005cc327 in thread_entry () from /boot/system/lib/libroot.so
#6  0x78137fec in ?? ()
(gdb) 

Any thoughts?

(Just to be sure, I tried the way you said -- running scheduling_recorder from the Terminal and starting StyledEdit directly (by clicking on the file). This way, I found I couldn't even quit scheduling_recorder with ctl-C! Nor did it quit when I closed the terminal. I had to do a quick 'ps -a' to find and kill it, by which time the recording had reached 25MB... Trying DebugAnalyzer on that file had the same crash as before.)

in reply to:  6 comment:7 by Pete, 12 years ago

Replying to Pete:

Replying to bonefish:

FWIW, I tried a couple of experiments:

scheduling_recorder sched.rec2 echo hello

generated a file that DebugAnalyzer had no trouble with (though I'm not sure I understand all it was displaying (:-/)

Recording StyledEdit by itself without the large file crashed in the same way as before, as did something like DeskCalc.

comment:8 by Pete, 12 years ago

I think I've at least found out how to prevent scheduling_recorder from generating an unhandleable file. It seems to fail if any GUI (?) dependent app is started after the recorder itself. Command-line-apps seem to be OK.

What I found I could do, was first start StyledEdit (with the 'StyledEdit-info' attribute removed from the work file to prevent immediate lockup), and in a Terminal, quickly start

scheduling_recorder sched.rec sleep 10

then immediately resize the window.

This generates a readable 10-second file that I could look at with the analyzer. Unfortunately I don't know enough about what it's showing me to tell much, except that the StyledEdit window eventually takes over everything. Even the event loop and the cursor loop get completely frozen out.

If anybody else thinks it's useful to look at this file (1.3MB) I'll attach it here, but I won't waste space if nobody's interested.

comment:9 by ttcoder, 12 years ago

At first I was tempted to try it out on my own, but seeing the trouble you went through and the fact I have no particular skill in kernel-land/locking/etc to give me hope I'll fare better, I'm very tempted to use the "let's not duplicate work!" card ;-)

After you upload maybe we could create a "digest" version of the file with excerpts that seem interesting, for analysis by Ingo ..etc to make their job easier..

I'm guessing one of the haiku crew will have an "elementary my dear watson!" moment eventually, if provided with enough hints to narrow down the list of suspects:

  • if a lock is responsible, it would be one common to app_server and media_server, so maybe that narrows things down to the kernel?
  • there might be something specific to single-core systems (this has yet to be confirmed by someone with a dual/quad core doing the simple experiment of disabling all but one cores and replicating the bug though.. or to the contrary, reporting that the freeze occurs even with all cores enabled.. -- oh this reminds me, I'll ping Sean and try to bring him in to this ticket :-)

If I had more time I'd try to code a minimalist .cpp that reliably reproduces the freeze, maybe a ten-line program that hammers FrameResized()... I could end up realizing this does not freeze anything and it takes an additional ResizeTo() call to freeze the system, which could hint at the video driver being involved (those have a kernel-land counterpart to the user-land accelerant after all, probably involving locks of some sort ?). Or if it didn't, it would hint at the app_server ..etc and so on.

Version 0, edited 12 years ago by ttcoder (next)

comment:10 by ttcoder, 12 years ago

Just realized while catching up on tickets history that Ingo is debugging DebugAnalyzer.. Things are moving, we just have to be patient 8-)

in reply to:  10 ; comment:11 by bonefish, 12 years ago

Replying to ttcoder:

Just realized while catching up on tickets history that Ingo is debugging DebugAnalyzer.. Things are moving, we just have to be patient 8-)

While the crash in DebugAnalyzer can be fixed -- it iterates twice through the data, but handles the unknown threads only the first time -- that wouldn't really help to get a usable analysis, since thread names and team associations for those threads would be missing. The main issue is that "add" entries for those threads are missing in the data, which is something I apparently broke when last working on the kernel.

At any rate, Pete, please add the data file you've got. It shouldn't be too hard to see why e.g. the mouse cursor thread is blocking (i.e. on which wait object) and who's responsible.

comment:12 by SeanCollins, 12 years ago

I tried some of the experiments mentioned above on both my quad and hexa core cpu's. I definitely see spike in cpu usage but no gui blocking. I was able to replicate the symptoms on a single core machine easily. It appears that on the dual/tri/quad/hexa core cpu's the problem is not apparent as there is enough cpu horsepower to overcome the problem without the serious performance degradation.

by Pete, 12 years ago

Attachment: sched.rec added

scheduling_recorder output from StyledEdit freeze

in reply to:  11 comment:13 by Pete, 12 years ago

Replying to bonefish:

At any rate, Pete, please add the data file you've got. It shouldn't be too hard to see why e.g. the mouse cursor thread is blocking (i.e. on which wait object) and who's responsible.

Attached... For reference, I started resizing the window around the 2.85 sec mark, and the freeze set in at about 5 secs or a little before.

comment:14 by bonefish, 12 years ago

Cc: mmlr added

Thanks, Pete! Looking at the analysis a few things are interesting:

  • The "cursor loop" thread, which I believe extracts the mouse events from the input server messages and updates the cursor position, shows no activity at all between 5.15 s and 6.95 s and only little activity afterwards (compared to its activity before 5.15 s). I guess the activity depends on the actual mouse movement, so it's hard to say whether the activity matches your mouse movement at that time. Its wait behavior looks pretty much like one would expect, I suppose: It waited 294 times on the "cursor semaphore" for a total of 9.74 s and 4 times on the "floating overlays lock" for a total of 0.004 s. So the frozen cursor would either point to the event or the drawing side of things.
  • I noticed that there's a total wait time of 1 min 1 s (1159 times) on the "ports list" mutex, which is a serious problem. I'm not sure, if I messed that up when reworking the kernel locking or if it was already broken before. From a quick scan of the ports code one might think that locking the port's lock while holding sPortsLock (particularly in get_locked_port() and set_port_owner()) could be the reason, but none of the port locks actually shows up with a significant amount of wait time ("system:roster" with a total of 0.0001 s, four other locks with even less). So something else must be the cause. At any rate, I believe the enormous contention on the "ports list" lock is the cause for the visual freeze, since port messages cannot be sent and threads cannot start waiting on a port without first acquiring this lock. And port messaging is kind of the backbone of communication with and within the app server.
  • I also noticed that there are 397 waits and a total wait time of 7.44 s on the "USB Stack Allocator" mutex. This sounds relatively much for a mutex and might be something that should be optimized. CCing mmlr for good measure.

comment:15 by bonefish, 12 years ago

I examined the analysis a bit further and it actually doesn't seem to be a bug in the ports code as such. It looks more like a bad case of priority inversion. Since threads with all kinds of different priorities use port messaging there's a good chance for this to happen for the involved kernel locks. Looking at the scheduling data at around 5.53 s one can see that the "Tracker" thread (priority 10) -- which has been waiting for around 62 ms -- acquires the "ports list" mutex. Since there's a higher priority thread running (or ready) at the time it doesn't run for another 0.22 s, thus blocking higher priority threads waiting on the same lock during that time, most notably the "_input_server_event_loop_" thread (priority 103).

There's quite a bit of locking ping pong going on afterwards, but I believe the core of the issue is that we have a FIFO policy for mutexes and other locking primitives but no strategy (e.g. priority boosting) to address the priority inversion issue. That also explains why multi-CPU machines don't have the problem (the high-priority thread runs on one CPU, while the other threads can still be served by another CPU) and why decreasing the thread priority to 10 helps as well (all threads using port messaging probably have a priority >= 10 and thus still get a fair share of the CPU).

I guess the only good solution is to implement some strategy to counter priority inversion.

comment:16 by SeanCollins, 12 years ago

I wonder if you thoughts on scheduling and locking ping pong would have any effect on the amount of performance disparity seen in building Haiku.

I get this for a compile Time building Haiku on Haiku. I understand the Haiku Kernel design does take a efficiency drop off in comparison to more throughput favoring designs. But this seems a bit atrocious to me.

I got a system time of 15min and a kernel time of 50min to build Haiku. The disparity is much smaller on Linux system. I just wouldn't expect such a drastic difference in compile time. Maybe this is all pointing to the same symptom.

comment:17 by ttcoder, 12 years ago

@bonefish if you suspect a particular rev. to be the trigger of the regression (or if a range of revs are candidate for bisecting) I can test that here if you need. But in your latest comment it seems the suspicion is not on a recent regression, but in mutex acquisition and thread scheduling being "out of phase" with each other ? So a thread can make a system call to acquire_sem(), and the kernel may grant that semaphore but not schedule the thread to work (and eventually release the sem) immediately, hmmm. The kernel neophyte in me wonders, why not do it the other way around: when the kernel schedules a given thread for running (according to priority ..etc) at that time it also looks at what system calls are pending for that thread, and grants it the resource.. (if available, otherwise it goes on to next schedulable thread).

In parallel I'm digging a little to find what kind of system resources (ports..) are involved in this bug and thus could be the culprit. I've been thinking of StyledEdit's BTextView (which is more in my area of "expertise")... The hammering could be caused by the long lines "Soft Wrap'ping" calculations. Except the bug does not occur on the initial file opening with default window size, only when resizing the window (and soft wrapping) to a larger size... Note to self: try to uncheck the "wrap long lines" menu item and see if StyledEdit still freezes the system.

More notes to self: when you have a million+ lines to soft-wrap in a textfile, that must entail an awful lot of BView|BFont::StringWidth() calls. These proably involve a synchronous round-trip to the app_server, with a write_port() call, from within the B_DISPLAY_PRIORITY window thread. Unless StyledEdit has a mechanism in place for caching string lengths and avoiding such performance hits.. But given how its performance degrades I guess it uses normal code. If this explanation holds water, it'd mean a minimal test case to replicate the freeze would look like (a more compilable version of) this:

BApplication app;  // create link to app_server
BWindow win;
BView view;
win.AddChild( &view );  // make sure StringWidth() will round-trip to app_server (maybe BWindow::Show() is a pre-requisite first?)

for( int i = 0; i<1000000; i++ )
{
    float dummy = view.StringWidth( "is this string more thann 80 columns and to be wrapped" );
}

In BTextView it seems that BFont::GetHeight() is called from within a very inner loop, should check if that implies a round-trip to app_server... And here is an uncached StringWidth(), together with an ifdef'ed cached one, hmmm.

comment:18 by axeld, 12 years ago

While this doesn't solve the general priority inversion issue, we could at least improve the ports problem by having 'real time' ports that go into another port list. The port lock wait time sounds pretty high in any case.

I could imagine not using strict FIFO order in mutexes to get quite complex.

in reply to:  17 comment:19 by bonefish, 12 years ago

Replying to axeld:

While this doesn't solve the general priority inversion issue, we could at least improve the ports problem by having 'real time' ports that go into another port list.

I don't think this would help much in this case. It might get the mouse cursor going again, but pretty much everything else would still be toast.

I could imagine not using strict FIFO order in mutexes to get quite complex.

The implementation would be rather simple, I think. Instead of queuing new thread at the end of the waiting threads list, they would be inserted according to their priority. That might improve the situation, but still, once a low priority thread gets the lock and is prevented from running by a busy higher priority thread, any other thread that tries to acquire the lock will be blocked, resulting in very high latencies for high priority threads (like the 0.22 s observed in the example).

A fairer scheduler -- that allots lower priority threads more run time -- could improve the situation as well. That would also increase latencies for higher priority threads, though.

Implementing priority inheritance is the only strategy that seems to solve this kind of issue in general. It's unfortunately also not completely trivial to do.

The problem at hand is actually caused by a relatively high priority (15) thread doing a lot of work, which conflicts with the general policy "the higher the priority the less work should be done". So, regardless of whether or not we address the priority inversion issue at kernel level, BTextView or StyledEdit (whoever is to blame) should also be fixed to offload the work to a normal priority thread.

Replying to SeanCollins:

I got a system time of 15min and a kernel time of 50min to build Haiku. The disparity is much smaller on Linux system. I just wouldn't expect such a drastic difference in compile time. Maybe this is all pointing to the same symptom.

I don't think this is related at all. I particularly analyzed the Haiku building test case a while back and didn't notice any problem of that kind. I believe the main difference between Haiku's kernel and the kernel of Linux or FreeBSD is that they are a lot better optimized. E.g. we don't do any pre-faulting. IIRC FreeBSD maps up to 7 or 8 additional pages on a page fault. Thus they decrease the number of page faults significantly and from what I recall from profiling page faults make up a significant chunk of the kernel work that happens while compiling. Furthermore Linux and FreeBSD have a lot more optimized common functions (like those in <string.h>) than Haiku. This reduces kernel times and user times (which on Haiku are also much higher than on the other systems). So, I believe the performance gap is mostly caused by missing optimization in Haiku. Also note that in case you tested with the default kernel debug level (2) that adds significantly to the kernel time.

Replying to ttcoder:

@bonefish if you suspect a particular rev. to be the trigger of the regression (or if a range of revs are candidate for bisecting) I can test that here if you need. But in your latest comment it seems the suspicion is not on a recent regression, but in mutex acquisition and thread scheduling being "out of phase" with each other ? So a thread can make a system call to acquire_sem(), and the kernel may grant that semaphore but not schedule the thread to work (and eventually release the sem) immediately, hmmm. The kernel neophyte in me wonders, why not do it the other way around: when the kernel schedules a given thread for running (according to priority ..etc) at that time it also looks at what system calls are pending for that thread, and grants it the resource.. (if available, otherwise it goes on to next schedulable thread).

ATM the scheduler only considers threads that are ready to run. It doesn't even know about waiting threads. When a thread releases a mutex (similar for other locking primitives) it picks a thread currently waiting on that mutex (ATM strictly FIFO) as the next lock owner and puts it in the ready-to-run scheduling queue. This makes the thread, which now holds the lock, eligible to run. When it will actually run is a decision of the scheduler. As written above, changing the FIFO strategy to a priority based one (which goes in the direction you suggest) wouldn't be that hard, but it wouldn't fully solve the issue. Moreover, since higher priority threads would be able to overtake lower priority ones, the latter can now starve while waiting on a lock, which is an undesirable property as well.

Postponing the decision which of the waiting threads to make the new lock owner even further, into the scheduler (which is what you suggest), would make the scheduler quite a bit more complicated. It would no longer only need to consider threads that are ready to run, but also sets of threads from which to maybe choose one. Possible, but not even a full solution to the problem either, since once the scheduler has chosen a thread, that thread might be preempted while still holding the lock, which would again set the stage for priority inversion.

comment:20 by axeld, 12 years ago

Is priority inversion really so hard to circumvent, though? If we only put this functionality into mutex, it would just need to adjust the current lock holder's thread priority if a higher priority thread enters its wait queue, right?

in reply to:  20 comment:21 by anevilyak, 12 years ago

Replying to axeld:

Is priority inversion really so hard to circumvent, though? If we only put this functionality into mutex, it would just need to adjust the current lock holder's thread priority if a higher priority thread enters its wait queue, right?

Correct me if I'm wrong, but I thought the issue isn't threads that're also waiting on the mutex (since those by definition aren't going to be scheduled), but rather, other high priority threads that actually are ready and being scheduled, thus preventing the lower priority thread that holds the mutex from doing its work and releasing it again? A simple strategy could indeed be to boost the priority of the mutex owner until they unlock it again, but then the question would be, how much do you boost it by?

comment:22 by axeld, 12 years ago

I am not really a fan of unconditional thread priority boost after acquiring a lock - we actually had such a system in place, once, but it didn't really work with our threading model. Maybe works better when we restrict it to mutexes, though, but I somehow doubt it; why should a low priority thread be able to preempt a high priority thread just because it acquired a mutex?

Priority inversion means that low priority threads are able to starve high priority threads by sharing the same lock - due to other system activity, the low priority thread doesn't get through with its work, so the high priority thread effectively inherits the low priority, thus it becomes a victim of priority inversion.

Last edited 12 years ago by axeld (previous) (diff)

in reply to:  20 comment:23 by bonefish, 12 years ago

Component: Servers/app_serverSystem/Kernel

Replying to axeld:

Is priority inversion really so hard to circumvent, though? If we only put this functionality into mutex, it would just need to adjust the current lock holder's thread priority if a higher priority thread enters its wait queue, right?

That's the basic idea. There are a few details that complicate things, though. When a thread unlocks a mutex (or transfers ownership) its priority possibly needs to be adjusted again. Since it can also hold other mutexes at that time, the thread needs to maintain a list of mutexes it holds, so the correct new priority can be determined. That also requires us to check whether all code that uses mutexes explicitly destroys them. Timeouts on mutexes add to the fun (as the holder's priority might need to be adjusted).

That's all doable, but not completely trivial either.

comment:24 by axeld, 12 years ago

I would implement an easier approach at first, and see if that already serves us good enough, namely just boosting the next quantum of that thread. But you've probably removed the thread::next_priority field already in the mean time, I guess? That would make this change really simple.

Implementing the real thing unfortunately also makes mutexes a bit more expensive.

comment:25 by SeanCollins, 12 years ago

@bonefish

I do not have any changes made to the kernel debug levels. I can reduce those settings and retest to see if the build time is dramatically effected ?? Thank you for the explanation

@ Ticket

If you have some ideas you want tested, let me know and I will test any patch's you come up with.

in reply to:  25 comment:26 by bonefish, 12 years ago

Replying to axeld:

I would implement an easier approach at first, and see if that already serves us good enough, namely just boosting the next quantum of that thread. But you've probably removed the thread::next_priority field already in the mean time, I guess? That would make this change really simple.

Thread::next_priority is still there. Since it is only used (and immediately reset) in enqueue_in_run_queue() that implementation might not be that simple either.

Implementing the real thing unfortunately also makes mutexes a bit more expensive.

Memory-wise it should be three pointers (list link and thread pointer) plus maybe a priority field and minus the counter field. Performance-wise it shouldn't affect uncontended locks at all. It should still be possible to implement mutex_{lock,unlock}() for a mutex that isn't locked respectively on which no other thread is waiting inline as an atomic operation. Only the less likely operations will become more complex.

I have to amend what I wrote in my previous comment: There's no need to maintain a list of all locks a thread holds. It is sufficient to include only the locks that other threads are waiting on. That also relieves us from reviewing all code that currently uses the locks we would change with respect to explicit lock destruction, since a lock a thread is still waiting on mustn't be destroyed anyway.

Replying to SeanCollins:

@bonefish

I do not have any changes made to the kernel debug levels. I can reduce those settings and retest to see if the build time is dramatically effected ?? Thank you for the explanation

It's a compile time option. Have a look at the build/config_headers directory, particularly the ReadMe and kernel_debug_config.h (the KDEBUG_LEVEL definition). I probably wouldn't call the effect on the build time dramatic, but it's definitely significant. The effect on the kernel time alone is even more significant.

comment:27 by ttcoder, 12 years ago

Blocking: 7285, 7882 added

comment:28 by ttcoder, 12 years ago

Blocking: 8136 added

(In #8136) You probably have a single-CPU machine, like I have?

If yes, then duplicate of #7882 in symptoms and #8007 for root cause..

comment:29 by ttcoder, 12 years ago

@bonefish To keep the ball rolling on this I'm willing to work on a wannabe patch.. It won't be the Real Thing of course since I don't know what I'm doing in kernel-land, but it could end up being better than nothing for making the actual patch happen by someone who does ;-)

I've found private/shared/lock.h and private/kernel/lock.h ; the latter seems to be the right one. And struct mutex_waiter is defined here.

I'll have to re-read this thread several more times, as I'm still not sure if the patch needs to modify thread.cpp too, or just lock.cpp ? I mean, in order to implement "priority boosting", the strategy which is non-trivial but has best potential to solve this ticket, by opposed to the "insert waiting threads in the queue according to their priority", which is easier to implement but not the best strategy. (did I get this right).

Hmmm maybe I should wait until the Svn/Git transition is finished before posting the links supporting this discussion, source code is annoying to read on the temporary server...

comment:30 by bonefish, 12 years ago

Blocking: 7285 removed

I don't see how #7285 is related. AFAICT it's about problems in the chain of media nodes involved and issues in the media kit. The other two tickets may or may not be related. They could as well just be caused by latencies not being guessed/computed generously enough. Without analyzing the scheduling behavior that's impossible to say.

@ttcoder: You found the right lock.h/cpp. If you're referring to the priority inheritance approach, there are a few things that need to be modified: In the Thread structure the contended locks the thread holds need to be tracked and accordingly the lock structures need to be adjusted. Obviously the locking functions themselves need to be modified. Moreover changes in the schedulers are needed as well. I haven't really thought about, whether another priority field is needed in the Thread structure or anything else.

I don't want to discourage you working on that, but please note that there's no documentation (other than what you find in the sources) how things work together and I don't really have time and motivation for introductory explanations (I suspect the same holds for others). If you have questions about particulars, feel free to ask.

in reply to:  17 ; comment:31 by leavengood, 12 years ago

Replying to ttcoder:

In parallel I'm digging a little to find what kind of system resources (ports..) are involved in this bug and thus could be the culprit. I've been thinking of StyledEdit's BTextView (which is more in my area of "expertise")... The hammering could be caused by the long lines "Soft Wrap'ping" calculations. Except the bug does not occur on the initial file opening with default window size, only when resizing the window (and soft wrapping) to a larger size... Note to self: try to uncheck the "wrap long lines" menu item and see if StyledEdit still freezes the system.

I'm sure it is the Soft Wrapping which is the main cause of this problem from the StyledEdit side. As you probably know the whole layout needs to be recalculated whenever the window is resized, and even if the length of each line was cached (which might still be worthwhile) each line still needs to be looked at to find a space to wrap on. And since each line affects the one below it, all lines need to be considered when the window is resized.

While again the core problem in this issue should be fixed, I would think it should be possible to optimize the StyledEdit resizing somewhat. For example if only the height changes, there is no need to recalculate the layout, just show more text in the view. And in the case where the width changes, to update what is shown it only really needs to recalculate the layout from the start of the document to the view position (though I guess the total height needs to be known to properly size the scroll thumb.) But the whole thing should probably be done in a lower priority background thread either way.

And while it will consume memory (and even more for a large document) caching the line lengths would certainly save a lot of messaging with app_server on resize. Of course then there is the line heights. But if the document isn't styled it should be possible to just cache the line height of the font being used and use that for all lines.

Anyhow these might be some nice improvements, since I think even for small files the StyledEdit resizing isn't a great performer. But in general for large file editing the editor needs to be specifically coded for that use case: to not load the whole file in memory, not wrap lines, etc.

in reply to:  31 comment:32 by anevilyak, 12 years ago

Replying to leavengood:

While again the core problem in this issue should be fixed, I would think it should be possible to optimize the StyledEdit resizing somewhat. For example if only the height changes, there is no need to recalculate the layout, just show more text in the view. And in the case where the width changes, to update what is shown it only really needs to recalculate the layout from the start of the document to the view position (though I guess the total height needs to be known to properly size the scroll thumb.) But the whole thing should probably be done in a lower priority background thread either way.

FWIW, none of this is StyledEdit-specific. That app is more or less a thin wrapper around BTextView, which is where all of the aforementioned optimization issues lie.

comment:33 by ttcoder, 12 years ago

Regarding optimizing the BTextView inside StyledEdit,

The curious thing is, the layout is quite fast in the "standard window width" case: the first time you open the big textfile, it opens very quickly. If there is a way to adapt the piece of code handling standard width to the other case (or, if the same code handles both cases, find out why it over-performs so much in the initial case, compared to its poor performance in the resized-window case) then that would be promising: one could optimize the general resizing without embarking into big class architectural changes.

On the priority inversion front,

This is yet another case where I decided to start something and didn't go far -- never went beyond downloading the sources and marking with a /*XXXX*/ some of the functions that I understood were relevant from bonefish's pointers. I got myself a Core 2 duo 'puter now, and understand what all the rage is about in Haiku circles ;-) -- this thing indeed feels fast; got all the responsiveness I had in BeOS in the olden days and more!

And TuneTracker-Haiku will ship on dual-CPU systems, or so I hear from Dane, so little incentive there either to make it single-CPU friendly. I feel for those using Haiku on older machines though. I guess there's less and less of those anyway. [@Pete: hope you can find a used Core duo whose owner will let you boot a Haiku CD to check compatibility before selling it to you, that's how I got myself a cheap speed demon..]

Last edited 12 years ago by ttcoder (previous) (diff)

comment:34 by jua, 12 years ago

Attaching a patch which implements:

  • Sorting mutex waiter lists by priority
  • Priority inheritance for kernel mutexes

Solves the problem on my netbook. On my desktop there is still slight freezing but that seems to be another issue unrelated to this ticket.

StyledEdit itself still needs to be updated too because of course it still freezes on large textfiles for a while and causes the rest of the system GUI to slow down (tracker/deskbar refresh very slowly, but mouse and windows can still be moved around) by doing lots of work in a GUI thread.

This is the second, fixed version of the patch, different to the one I originally posted on the mailing list. Had some slight problems with git, hopefully didn't mess up the patch (but it seems to me everything is in there that should be in it)...

by jua, 12 years ago

Patch implementing waiters priority sorting and mutex priority inheritance

comment:35 by jua, 12 years ago

patch: 01

comment:36 by SeanCollins, 12 years ago

I get build errors on gcc2, with the latest nightly 43812. Patch applied.

C++ /boot/home/haiku/generated.x86gcc4/objects/haiku/x86/release/system/kernel/lock.o 
/boot/home/haiku/src/system/kernel/locks/lock.cpp: In function 'void _mutex_transfer_lock(mutex*, thread_id)':
/boot/home/haiku/src/system/kernel/locks/lock.cpp:1039:51: error: 'void _mutex_transfer_lock(mutex*, thread_id)' was declared 'extern' and later 'static' [-fpermissive]
/boot/home/haiku/headers/private/kernel/lock.h:157:13: error: previous declaration of 'void _mutex_transfer_lock(mutex*, thread_id)' [-fpermissive]
Last edited 12 years ago by SeanCollins (previous) (diff)

comment:37 by jua, 12 years ago

I think you mean gcc4, because it compiles fine on gcc2 here, but I get the same error as you on gcc4 (which I hadn't tried out before, oops).

I removed the erroneous static qualifier from the function definition in lock.cpp and verified that it works on gcc2 and gcc4. Weird that gcc2 didn't issue a warning...

Attaching version 3 of the patch, which now also compiles with gcc4.

comment:38 by jua, 12 years ago

patch: 10

by jua, 12 years ago

Version 3 of patch implementing priority inheritance for mutexes and priority sorting for mutex waiters

comment:39 by jua, 12 years ago

patch: 01

comment:40 by SeanCollins, 12 years ago

This patch works very good, until you get a high use thread with a priority greater then 15, and then things get kind of ugly. Instead of just sporadic UI blocking, you get very consistent UI blocking. It is certainly a trade off or sorts.

Mybae a prempt scheme every so many msecs for threads over X priority might be a good thing ? or it could be aweful. Maybe forcing the scheduler to group threads might make more sense ?

comment:41 by tqh, 12 years ago

What happens if you have an event that is signalled very seldom, lets say once a year. The thread that holds the mutex during this time could run boosted for a full year which don't seem fair to me.

Edit: Ooops. Axel already suggested this: Perhaps just delegating one scheduling quota from a thread that gets blocked to the blocking thread would be better?

Last edited 12 years ago by tqh (previous) (diff)

comment:42 by jua, 12 years ago

Quoting (slightly changed) what I wrote on the mailing list:

[priority boosting] I think one also has to see the other side: it already happens implicitly. When a low-priority threads holds a mutex needed by a high-priority thread it effectively blocks the high-priority thread from working... all it can do is stall and waste CPU cycles. Even though there is no explicit priority change, it happens implicitly and causes freezing.

So in the end, even if the boosting goes on for a year, it would be justified IMO.

(btw, would kernel mutexes really be used to signal rare events anyway, instead of, say, semaphores?)

in reply to:  42 comment:43 by tqh, 12 years ago

Replying to jua:

(btw, would kernel mutexes really be used to signal rare events anyway, instead of, say, semaphores?)

Hopefully not, but any lengthy work in a mutex would do for that example. I don't have a good answer what should be done. Just wanted to hear what the opinions are.

comment:44 by jua, 12 years ago

Another small update to the patch, two changes:

  • Use scheduler's priority change function in unboost, instead of changing it directly.
  • Clear boost flag when setting priority in the scheduler... so if a thread's priority gets changed while it is boosted, it will not get unboosted to its old priority later.

comment:45 by jua, 12 years ago

patch: 10

by jua, 12 years ago

Version 4 of patch implementing priority inheritance for mutexes and priority sorting for mutex waiters

comment:46 by jua, 12 years ago

patch: 01

comment:47 by SeanCollins, 12 years ago

The more I think about this, the more I think boosting/unboosting are really overly complex solutions to a potentially simple problem. Why not create a array with thread arrival/depature. Then order the threads by using a 2 axis matrix. Priority vrs arrival, this way the scheduler always balances out.Then use a lambda value of 1, and give the scheduler a range of selection adjustment of .9 and 1.1 to allow the changes in processing.

If you think this is a valid idea, I can work out the algorythm to implement it, it'd take me a few days, but it would work with the haiku priority model, while hopefully reducing complexity and increasing reliability.

in reply to:  42 ; comment:48 by axeld, 12 years ago

Replying to jua:

(btw, would kernel mutexes really be used to signal rare events anyway, instead of, say, semaphores?)

Not sure what you mean by signaling, but this sounds like a misconception about what a mutex is. Unlike a semaphore, a mutex is a simple lock. There is nothing to signal, it's only used to protect critical sections, ie. to make sure only one thread runs at a time with that particular lock held.

For signaling, condition variables can be used in the kernel, and even though they are connected to a mutex, you don't hold the lock while you're waiting for the condition to arise.

Anyway, I had a look at the patch, and apart from like hundreds of coding style violations, I'm not sure the way it's implemented is how it should look like; while it should be rare or never happen that a thread acquires 32 different mutexes, imposing such a limit (with no noticeable overflow handling) is troublesome. An alternative would be to use the same approach as with the waiters list - that would also speed up the insert, at least. I would have hoped that bonefish comments on the patch, but apparently he didn't find the time to do this yet.

A few words on the coding style issues, and a bit more:

  • two blank lines between functions.
  • we use camelCaseForVariableNames.
  • '{' goes to the same line as if/while/etc.
  • You mix asterisk style.
  • In general, having the code look like the one that is already there is a good thing to try.
  • Things like ASSERT(thread) makes no sense when the function would immediately crash in that case, anyway (besides that we would write ASSERT(thread != NULL)).
  • Same for asserting that thread_get_current_thread() didn't return NULL. It can't.
  • Comments like "Add the lock we just acquired to the array of held locks for this thread." right before you call add_lock_to_held_locks() are superfluous.
  • Instead, important comments like "guarded by the scheduler lock" are missing.
  • You sometimes mix KDEBUG code with non-KDEBUG code (like _mutex_transfer_lock()).
  • There is no reason for the static inline mutex_transfer_lock() anymore, if all it does is calling _mutex_transfer_lock().
  • mutex::holder becomes superfluous as well, AFAICT.

in reply to:  48 ; comment:49 by jua, 12 years ago

Replying to axeld:

Anyway, I had a look at the patch

Thanks a lot for taking the time!

apart from like hundreds of coding style violations

Hmh, I'm surprised how many style violations I still made even after trying to ensure I didn't. Looks like my "regular" coding style I've used for so long time still crept through, which explains the mix of styles... sorry. I will fix that, thanks for pointing out details.

Not sure what you mean by signaling, but this sounds like a misconception about what a mutex is.

I'm well aware what a mutex is, however, even a mutex can be (mis)used to "signal" another thread (i.e. by releasing it when knowing the other thread is waiting for it). I probably misunderstood tqh's comment, it sounded to me as if mutexes in the kernel would possibly be used in that way; it's good that they are not.

while it should be rare or never happen that a thread acquires 32 different mutexes, imposing such a limit (with no noticeable overflow handling) is troublesome

This limitation was born out of necessity. I first wanted to implement it by using a list, like you propose, but then quickly ran into a simple problem: I could not allocate (heap) memory for new list items from within mutex_lock. Maybe I overlooked something, but all flavours of malloc I found need to lock a mutex themselves... and then mutex_lock calls malloc which calls mutex_lock which calls malloc etc... (and before it goes towards infinity, it quickly crashes with a double lock acquire). It would need a special malloc that can work without a normal kernel mutex. If there is anything like that, I didn't find it. And without using malloc, a fixed size array seemed to be the only sane option.

About overflow handling: it is there, however possibly too simple (?). If more than 32 mutexes are acquired, they simply don't get added to the array anymore. I.e. for priority boosting they will be ignored. It will not crash or fail in any way, the additional mutexes will simply not be regarded when selecting a potential new boost priority. However, this of course imposes the restriction that the array of held locks is not used for anything else in the kernel that might require a complete list.

in reply to:  49 ; comment:50 by bonefish, 12 years ago

Replying to jua:

Replying to axeld:

while it should be rare or never happen that a thread acquires 32 different mutexes, imposing such a limit (with no noticeable overflow handling) is troublesome

This limitation was born out of necessity. I first wanted to implement it by using a list, like you propose, but then quickly ran into a simple problem: I could not allocate (heap) memory for new list items from within mutex_lock. Maybe I overlooked something, but all flavours of malloc I found need to lock a mutex themselves... and then mutex_lock calls malloc which calls mutex_lock which calls malloc etc... (and before it goes towards infinity, it quickly crashes with a double lock acquire). It would need a special malloc that can work without a normal kernel mutex. If there is anything like that, I didn't find it. And without using malloc, a fixed size array seemed to be the only sane option.

I haven't looked at the code yet. However, as far as I can tell from the comments, I suppose a simple alternative here would be to add a list link to the mutex structure.

in reply to:  50 ; comment:51 by jua, 12 years ago

Replying to bonefish:

I haven't looked at the code yet. However, as far as I can tell from the comments, I suppose a simple alternative here would be to add a list link to the mutex structure.

Oh yes, great idea, that would work :)

So... the question is now: is there interest from the kernel devs in this patch at all? If yes, I'd put some more work into it to fix the style issues, change implementation and find a bug that I've discovered recently. However, if the consensus is that the problem in this ticket should be solved in a completely different way anyway, then I doesn't make sense to put more time into this patch. Any opinions?

in reply to:  51 comment:52 by Pete, 12 years ago

Replying to jua:

So... the question is now: is there interest from the kernel devs in this patch at all? If yes, I'd put some more work into it to fix the style issues, change implementation and find a bug that I've discovered recently. However, if the consensus is that the problem in this ticket should be solved in a completely different way anyway, then I doesn't make sense to put more time into this patch. Any opinions?

I'm not a kernel dev, nor have I had a chance yet to try out the patch (aside from my abortive quickie attempt) but from what I can see this looks like essentially the correct approach -- with maybe the switch to a linked list to remove the 32 limit -- so I would not lose interest! I for one am desperately awaiting a solution, as I think it will fix the general 'blackout' problem I experience with audio, as well as actually making Web+ usable!

in reply to:  51 comment:53 by axeld, 12 years ago

Replying to jua:

Replying to bonefish:

I haven't looked at the code yet. However, as far as I can tell from the comments, I suppose a simple alternative here would be to add a list link to the mutex structure.

Oh yes, great idea, that would work :)

That was what I was thinking of. But since your code handles overflows well, as well (I missed that before), making the array smaller to fit into a cache line might be another approach. Don't know what would turn out to be better, but I think it should be rare to hold more than 4 mutexes at a time.

So... the question is now: is there interest from the kernel devs in this patch at all?

I'm not much of a kernel dev these days, and I can't really comment on the way you tackled the problem (ie. the scheduler interaction) without reading the code in question again, but priority inversion is a real problem, and solving it would definitely be a good idea, so I'd say go for it :-)

comment:54 by Pete, 12 years ago

I'm afraid it's not really working for me. I finally got to do a complete system build with the patch (number 4) applied, and have it as an alternate 'system' folder.

When I first boot up with it, I notice improvement in things like the StyledEdit problem that started this ticket. I can resize the window at will without locking the system up. (The contents of the window still take a very long time to get redrawn, but the cursor remains responsive.) However, at some point something happens -- not quite sure what... I opened a terminal window to run 'top' amongst other things -- and the system locks solid again. Not sure if it would have freed up eventually; I gave up and rebooted.

Similarly, Web+ initially doesn't seem to freeze the cursor, but eventually it goes back to its old ways. Maybe rather worse than with the standard build -- not sure. (It looks as if the freezes are associated with W+'s "offscreen" window, which seemed to show activity peaks in top's display at the same time. Bezilla typically does not freeze, though there is slight hesitation in the cursor when it is doing heavy window drawing.)

Finally, I didn't find the hoped-for improvement in audio glitching. Moving a window still causes just as many 'crackles' as before, despite the supposed higher priority of audio. (This is again real-time audio. The media player with it's large buffer doesn't normally glitch.)

in reply to:  54 comment:55 by Pete, 12 years ago

Replying to Pete:

I've done a little more investigation, and I probably shouldn't have been quite so pessimistic. In particular:

Finally, I didn't find the hoped-for improvement in audio glitching. Moving a window still causes just as many 'crackles' as before, despite the supposed higher priority of audio.

I was using Csound for this, which was not the best choice. I have to use a very small audio buffer (64 bytes) to keep the latency low. (Not the audio-chain latency -- this is Csound's fault, in the way it generates the audio.) My MusicWeaver/fluidsynth stuff can use a more sensible buffer size (960 bytes), and has much better behaviour. Most of the time...

I find that I can fire up a live music configuration, and play fluidsynth with *no* audible glitches, even when I move windows around. This is vastly different from my experience without the patch, where window movement causes continual crackles.

In this condition, it's the app_server (I guess) that slows down! Only when the screen contains windows belonging to the audio app, though; other workspaces seem normal. Window movement starts lagging behind the cursor. I can understand this if the audio threads are getting preference, except that the total load on the CPU (shown by 'top') doesn't seem excessive -- about 7-10%. And, as I say, it's only when related windows are visible. In any case it's much preferable to ugly audio!

However... things don't always seem to work this way. I seem to still be able to get into a situation where window movement interferes with audio, and it then continues this way, much like without the patch. I haven't been able to determine what causes the switch, but I'm going to run the patched system as standard for a while, to see if I can figure out what's happening.

comment:56 by Pete, 12 years ago

After a little more work, I've tracked down one problem with the current patch: thread priorities can be left permanently boosted!

So far I've been totally unable to replicate the failures I thought I was seeing in audio, but I was finding Web+ unusable. I then noticed that this only happened when I had my fluidsynth stuff running. I did a 'ps -a', and noticed that W+'s main thread had priority 110 ('top' sometimes showed it using 100% CPU...)! And the MusicWeaver's main thread had the same unnatural priority! This would only happen when I started up the synth thread (which does have normal priority 110).

Going back to the normal Haiku build, none of this happens, so I assume the effect is due to a thread getting boosted to overcome a deadlock, but not getting reset when the contention is over.

comment:57 by jua, 12 years ago

After a little more work, I've tracked down one problem with the current patch: thread priorities can be left permanently boosted!

Yep, I'm aware of that, that's the bug I meant when I wrote "find a bug that I've discovered recently"... I will fix that in the next patch version. I hope to find some time to work on it again soon.

comment:58 by jua, 12 years ago

patch: 10

comment:59 by jua, 12 years ago

After finally finding time again to work on this, I recently mostly rewrote the whole thing taking into account the various comments given here and also made many other enhancements, fixes, etc. But now I'm stuck and unsure if I can finish it.

The problem is the following:

  • To implement the priority inheritance properly while keeping the mutex functions inline, I need access to functions/types from thread.h and thread_types.h within lock.h. This created a cyclic include file dependency. As advised by Ingo on the mailing list, I could resolve the problem by splitting some include files, moving the inline functions to a separate file. Most notably, I split lock.h into lock.h / lock_inline.h and thread_types.h into thread_types.h / thread_types_inline.h. This solved the dependency issue and let me finish the work inside the kernel.
  • The kernel mutexes are used in many places, and in all these places I need to add "#include <lock_inline.h>" now. This mostly works fine, but in a few places it causes big trouble... those places are outside the kernel itself i.e. in kernel add-ons. Including lock_inline.h leads to the following include-file chain: lock_inline.h -> thread.h -> thread_types.h -> ksignal.h
  • ksignal.h defines struct QueuedSignalsCounter which inherits from class BReferenceable. The fact that the latter is a class is the problem: some kernel add-ons (such as the fat file system add-on or locked_pool) are C and fail to compile when they see the class definition. Trying to tell gcc to just compile them as C++ leads to various ANSI C++ violations, and even setting -fpermissive to demote those errors to warnings leaves linker problems with things defined as extern C.
  • So in the end, my changes force everything that uses kernel mutexes to be compiled as C++ because thread.h requires that. Several add-ons don't compile as C++.

What could the solutions be? I currently see:

  • Change add-ons to be C++-compilable. I'm afraid I couldn't do that, or at least that would take me a long time because I don't want to do changes without really understanding the sources I edit, and it might be many add-ons... Also, maybe there are add-ons where this is not possible at all?
  • Make mutex functions non-inline. This would have a performance impact on the whole system.

The priority inheritance simply creates a closer tie between mutexes and threads, and thus using thread functionality inside the mutex_* inline functions becomes necessary. It's a little sad that it seems I have to stop here after putting a lot of work into it, but it seems to not work out without deeper changes in many parts.

comment:60 by bonefish, 12 years ago

jua, thanks for continuing to work on this. The C++ dependency is something I didn't think of. So far our general philosophy has been that public kernel API should be C while private kernel API can be C++ and code that wants to use it must live with the fact (usually by switching to C++, too). In case of the locking primitives the intention is, however, to make the API public eventually (maybe not in R1 yet), so introducing C++ dependencies is probably not the best move.

I think the way to go is to make the locking functions out of line after all. That is a bit of a pity, since currently, due to the atomic functions being built-ins (with gcc4 at least), the inline functions can be reduced to just a few machine instructions in the common locking cases (no one else has the lock/no one is waiting on the lock). Well, I guess it can't be helped. It would be great, if you tried to keep the code paths for those cases as short as possible anyway.

by jua, 11 years ago

Patch version 5

comment:61 by jua, 11 years ago

patch: 01

comment:62 by jua, 11 years ago

Sooo, after a few months break, here is a new version of the patch including the discussed changes.

This time I did a bit more testing to evaluate performance. I used SPLASH-2 (based on pthreads), a suite of concurrency benchmarks to test how the un-inlining of locking and the additional code affects performance. The suite needs only minimal changes to compile and work on Haiku. Here are some results of original Haiku vs Haiku with this patch, once with KDEBUG and once without:

Note: all values are time, i.e. lower is better! http://www.orangejua.de/haiku/original_vs_patched_kdebug_v2.png http://www.orangejua.de/haiku/original_vs_patched_v2.png

All in all, the performance hit in these benchmarks is not that big, even without KDEBUG.

And it does pay off... the SPLASH-2 benchmarks also turn out to be a great testcase for the freezing issues described in this ticket, even on multi-processor machines. Simply set them to use the number of CPUs/cores your machine has, and most of them will cause intense freezing on an unpatched Haiku, no more mouse movement, playing video just freezes, etc. With this patch applied however, no freezing anymore. All cores go to 100% CPU usage -- but still the cursor moves and the movie doesn't drop any frames (or at least none that I could see :) like it should be.

Hope I got the coding style right this time!

comment:63 by SeanCollins, 11 years ago

Using all the test cases provided in this ticket to date, as well as my ticket about app server responsiveness when using media apps and webpositive, the patch JUA has provided seems to have handled those issues well. If there are any pitfalls or obvious regressions I am not observing them.

This patch appears, to resolve the issue.

comment:64 by bonefish, 11 years ago

jua, thanks for the new patch. Regarding the performance tests, not knowing what they test exactly I cannot say how meaningful the results are. From a quick look at this paper the focus is definitely not on testing locking primitives. Testing userland locking primitives wouldn't be that interesting anyway, since the syscall overhead outweighs lengthened code paths in the kernel by far.

A helpful performance test would cause a lot of locking in the kernel. Instead of devising a test specifically for that purpose, I'd first try something simple like building a Haiku image with multiple jobs on an SMP machine (ideally 8 jobs on 8 logical CPUs). From my experience this causes quite a bit of lock contention in the VFS and VM subsystems already. It's also a real world test with practical relevance.

comment:65 by jua, 11 years ago

Hey bonefish! Yeah you are right that the SPLASH-2 suite(*) is not meant for testing locking primitives themselves, it's for testing concurrent calculation performance. They however do make extensive use of various locking primitives, although in the version I'm using it's pthreads, not the kernel locks. (If the kernel locks were public, it would actually be possible to use them directly: the modified splash-2 uses generalized macros for synchronization primitives which are replaced by the actual synchronization API calls by an M4 script before compiling. They supply an M4 script for pthreads, but it would be possible to create a new one to make them use the kernel locks directly...)

The reason why I thought SPLASH-2 benchmarks would be still relevant is that many of them cause intense freezing and thus I figured there is some connection to the kernel lock contention too, just like when building a Haiku image. I'm now preparing two more images (patched/unpatched) to test compiling Haiku with both.. but that will take some hours. I will report back when I have results.

(*) I'm using the modified version found here: http://www.capsl.udel.edu/splash/

comment:66 by jua, 11 years ago

patch: 10

Ok, here's are the results for building haiku images.

  • The machine has 2 cores, I made test builds with and without "-j2"
  • The first builds I made were accidentally with KDEBUG enabled, so I have results with and without that too...
(1) KDEBUG, jam @anyboot-image

Original            | Patched
--------------------+-------------------
real    95m5.039s   | real    95m47.065s
user    68m27.017s  | user    68m14.016s
sys     19m30.045s  | sys     19m58.322s


(2) KDEBUG, jam -j2 @anyboot-image

Original            | Patched
--------------------+-------------------
real    52m47.789s  | real    53m3.095s
user    69m24.631s  | user    69m9.291s
sys     22m11.243s  | sys     22m36.019s


(3) jam @anyboot-image

Original            | Patched
--------------------+-------------------
real    88m13.260s  | real    88m10.470s
user    67m59.852s  | user    68m0.633s
sys     13m5.890s   | sys     14m7.008s


(4) jam -j2 @anyboot-image

Original            | Patched
--------------------+-------------------
real    47m46.024s  | real    48m51.425s
user    68m50.666s  | user    68m55.215s
sys     13m44.617s  | sys     15m33.951s

Most relevant are cases (3) and (4), the sys time increases ~8% in (3) and ~10% in (4), but for the real time it looks much less bad. Note that I only ran these builds one time each, so there is certainly an error margin, especially because the build process downloads files... I didn't set up a caching proxy or anything, so internet download speed fluctuation is in there too. That probably decreases the significance of the real time values.

comment:67 by jua, 11 years ago

p.s.: I don't know why it unset "has a patch" with my last posting, I didn't select that, probably something with the new trac version?

in reply to:  64 comment:68 by SeanCollins, 11 years ago

Replying to bonefish:

jua, thanks for the new patch. Regarding the performance tests, not knowing what they test exactly I cannot say how meaningful the results are. From a quick look at this paper the focus is definitely not on testing locking primitives. Testing userland locking primitives wouldn't be that interesting anyway, since the syscall overhead outweighs lengthened code paths in the kernel by far.

A helpful performance test would cause a lot of locking in the kernel. Instead of devising a test specifically for that purpose, I'd first try something simple like building a Haiku image with multiple jobs on an SMP machine (ideally 8 jobs on 8 logical CPUs). From my experience this causes quite a bit of lock contention in the VFS and VM subsystems already. It's also a real world test with practical relevance.

I got a system time of 15min and a kernel time of 50min to build Haiku. The disparity is much smaller on Linux system. I just wouldn't expect such a drastic difference in compile time. Maybe this is all pointing to the same symptom.

as noted above, this time has not changed in any significant way for me, so the patch does not appear to impact performance in a statistically meaningful way.

in reply to:  66 ; comment:69 by bonefish, 11 years ago

Replying to jua: [...]

Most relevant are cases (3) and (4), the sys time increases ~8% in (3) and ~10% in (4), but for the real time it looks much less bad. Note that I only ran these builds one time each, so there is certainly an error margin, especially because the build process downloads files... I didn't set up a caching proxy or anything, so internet download speed fluctuation is in there too. That probably decreases the significance of the real time values.

The test procedure I used to use when testing optimizations is the following:

  1. Build an image.
  2. Remove the image and the objects, but keep the downloads, so they won't be downloaded again during the test. I don't think that the download part is significant for the test. I think it mostly adds useless wait times.
  3. Reboot.
  4. First test (cold): Build the image.
  5. Remove image and objects.
  6. Second test (warm): Build the image.

The second test is the more interesting one, since it reduces I/O wait times, thus increasing the pressure on the relevant code paths. A short version of the test could be to just build the kernel, not the complete image.

I'd like to test on my Core i7 with 8 jobs -- which should be even more interesting -- but I don't know when I'll get to do it.

in reply to:  69 comment:70 by anevilyak, 11 years ago

Replying to bonefish:

I'd like to test on my Core i7 with 8 jobs -- which should be even more interesting -- but I don't know when I'll get to do it.

I just attempted to do so on my i7 here ; however, with the patch applied I reliably get a panic very early in the boot process. The situation appears to be the main thread makes a call into the device manager, which in turn attempts to enlarge the locked pool. However, this panics with the claim that the main thread is attempting to release the locked_pool mutex which is currently held by thread 204, which doesn't exist. Will see if I can spot anything obviously wrong in the patch, as this is 100% reproducible here.

comment:71 by SeanCollins, 11 years ago

Can you try disabling hyperthreading ? I've seen similar panics with it on, maybe the patch is discovering a similar issue. I haven't had a single panic with it and I am going on a almost a week of solid uptime and heavy machine use.

comment:72 by anevilyak, 11 years ago

False alarm, needed a full rebuild rather than just the kernel. In any case, with kdebug level 2 there's pretty much no statistically measurable performance difference for a jam -qj8 kernel with or without the patch. Running a second set of tests with kdebug 0 now.

comment:73 by anevilyak, 11 years ago

Without patch, jam -qj8 kernel, average across 5 runs, kdebug 0, hrev44691:

real: 0m17.702s
user: 01m44.679s
sys: 0m8.755s

With patch:

real: 0m18.195s
user: 01m44.860s
sys: 0m10.988s

comment:74 by bonefish, 11 years ago

Rene, thanks for testing! The real time increase is fine IMO. I'm a bit worried about the kernel time, though. A 25% increase is rather much.

Rene, could you get a measurement on FreeBSD/Linux for comparison? I'd love to see where we stand particularly with respect to kernel time.

comment:75 by anevilyak, 11 years ago

Certainly, FreeBSD 9.0-RELEASE-P0 averaged across 5 runs for the same test (gcc4 cross compiler, as all tests on Haiku itself were done with 4.6.2):

real: 0m10.095s
user: 0m57.407s
sys: 0m5.720s

comment:76 by jua, 11 years ago

So the question is (once again): is the priority inheritance the way to go (accepting the overhead) or is it better to scrap this patch altogether and try to solve the problem at the points where it occurs, i.e. identify the locks and try to reduce their usage? (maybe the patch code can be further optimized to reduce the overhead, but some of it will always be there)

As one of the Be Newsletters indicates, original BeOS didn't have priority inheritance either -- and it worked without freezing after all.

in reply to:  76 comment:77 by bonefish, 11 years ago

Replying to jua:

So the question is (once again): is the priority inheritance the way to go (accepting the overhead) or is it better to scrap this patch altogether and try to solve the problem at the points where it occurs, i.e. identify the locks and try to reduce their usage?

IMO this isn't an "either... or...". Reducing lock contention is a good thing in any case. I guess that priority inversion cases will remain whatever we do, though. Hence I think priority inheritance is needed anyway.

(maybe the patch code can be further optimized to reduce the overhead, but some of it will always be there)

Sure, some overhead will remain. We should definitely look for possible optimizations, though. Comparing the absolute kernel time the patch adds with FreeBSD's total kernel time suggests that there must be potential for optimization.

As one of the Be Newsletters indicates, original BeOS didn't have priority inheritance either -- and it worked without freezing after all.

I believe BeOS used a lot more spinlocks in the kernel. AFAIK the ports mutex -- which is the culprit in this case -- was a spinlock in BeOS. Spinlocks don't suffer from the priority inversion issue, since they require interrupts to be disabled (which essentially boosts the thread's priority to infinity). The downsides are that they increase latencies of threads with higher priority (also those that don't content for the same lock) and that they cannot be used in all cases (when there's a possible wait while holding the lock).

jua, I'll try to find some time to review your patch thoroughly, but I can't make any promises when that will be.

comment:78 by umccullough, 11 years ago

patch: 01

reset "patch" flag that got turned off by accident

comment:79 by Premislaus, 11 years ago

Blocking: 9028 added

comment:80 by SeanCollins, 11 years ago

Ok, so I have had a few weeks to use this patch, and fundementally is does correct many of the issues observed with priority inversion. However it seems to have a fiarly nasty impact on network performance and the network interface seems to have a fiarly nasty impact on system performance. Also IO tasks seems to harm system response more heavily with this patch. I did allot of testing and observation. those may be unrelated to this patch in totallity, and may point to other bottle neck of inefficencys elsewhere in the core of the OS, but they do require mention imho.

comment:81 by pulkomandy, 11 years ago

About the inlining stuff, what about we use inlined version when compiled in C++ and the out of line version when compiled in C ? That could solve the performance problem in some cases, while allowing the use of the API from C for public exposure ?

comment:82 by Premislaus, 11 years ago

Any news? This is an important ticket.

comment:83 by diver, 11 years ago

Blocking: 9028 removed

comment:84 by jua, 11 years ago

Recently I have found some time to look into this problem again and I have a few new findings and a new possible/partial solution.

As seen from the earlier scheduler recordings, the issue mostly revolves around the ports list lock. It's a single mutex which needs to be briefly acquired every time a port is read and as such there is an amount of contention on it, since all the user interface interaction involves ports in one way or another.

While reading a great book on concurrent programming a while ago, I stumbled about its descriptions of lock-free concurrent data structures, including a lock-free hash set. Somehow, that made me think about this ticket again... wouldn't it be nice to eliminate the ports list lock entirely?

I've gone and implemented that, i.e. changed the ports list to be a lock-free hash set. The results are fine: there is no overall system performance degradation (not really an increase either, unfortunately) and the "freezing"-issues are mostly gone. By "mostly" I mean that when you e.g. play a music file in MediaPlayer and then do one of the actions which can reproduce the freezing... in the "normal" Haiku version (with ports list lock) everything grinds to a halt and the music stops playing entirely (seems to be an unrelated MediaPlayer or MediaKit bug, when the buffer lateness gets too high, it just freezes and won't even quit anymore). With the lock-free patch, there is a short "click" or two in the playback, but otherwise it keeps going. Why that still happens is another question that needs investigation... but whatever causes it, in the patched version it's definitely not the ports list lock! :)

Then there was another thing I noticed while doing testing and some benchmarks of unpatched/patched version (only using the lock-free patch on an otherwise unmodified Haiku, used none of the previous prio-inheritance-patches attached to this ticket)... on a freshly installed system, I couldn't reproduce the heavy freezing problems in an unpatched vanilla Haiku. Only the light "maybe a short hiccup"-kind of problem, not the heavy one where everything can stop for seconds. That was weird, had the problem suddendly solved itself...? Since my main Haiku installation has the freezing problems (and is roughly the same hrev as the testing-installation), I started to copy configuration files from /boot/home/config from the main installation to the testing installation... and at some point, I could reproduce the heavy freezing.

Turns out the heavy freezing is caused by.... Deskbar. Yup, I was surprised as well :) To be precise, it is the Deskbar Expander/Auto-Expander feature. When the Deskbar is in a screen corner and Auto-Expander is on (which is my preferred configuration but not the Haiku default), it creates heavy traffic on the ports list mutex, causing the heavy freezing. Turn Expanders off (or even just collapse all expanded entries by hand)... and the "heavy" freezing is gone and only slight freezing happens. I had a brief look at Deskbar sourcecode and the problem comes from the "monitor_team_windows()" thread in ExpandoMenuBar.cpp. With expanded items, it calls into the app server to get lists of windows and that seems to create quite an amount of port list mutex accesses per second.

As a simple workaround/relief to those who have the problem heavily, I can recommend turning off the expanders, that should make it better already. Maybe the problem can be solved within Deskbar or app_server... but at least as of now I know too little about app_server internals to comment on that.

Last but not least I wonder if the lock-free ports list is interesting for Haiku regardless. Without Deskbar causing havoc, there is little use for it. But if the system should always behave nice even when a program like Deskbar does that, it could be good to have. If there is interest I would clean up the patch a little and attach it here.

For reference, here are some scheduler recordings from me, they were made when running a little benchmark tool ("FMM" from Splash-2 suite) which is capable of producing the "heavy" kind of freezing:

(1) Vanilla Haiku, Deskbar with Expanders/Auto-Expand enabled http://www.orangejua.de/moo/temp/fmm-expand.sched

--> Small waiting time on ports list mutex, only little freezing.

(2) Vanilla Haiku, Deskbar with Expanders/Auto-Expand disabled: http://www.orangejua.de/moo/temp/fmm-no-expand.sched

--> Heavy traffic on the ports list mutex, heavy freezing.

(3) Haiku patched for lock-free ports list, Deskbar with Expanders/Auto-Expand enabled: http://www.orangejua.de/moo/temp/fmm-lockfree-portslist-expand.sched

--> Although Expanders are on, only little freezing, ports list mutex doesn't exist anymore.

comment:85 by bonefish, 11 years ago

Since sPortsLock doesn't only protect the ports hash table but also Team::port_list and Port::owner only using a lock-free data structure won't suffice. Maybe you addressed that as well?

Anyway, regarding optimizing the ports code, the following improvements come to mind:

  • Being a global lock sPortsLock shouldn't be used as an outer lock. The locking order sPortsLock -> Port::lock should be reversed. That requires making Port a BReferenceable, so get_locked_port() needs to hold sPortsLock only while incrementing the reference. Acquiring the port's lock would be done after already having unlocked sPortsLock. There are more changes necessary all over the ports code to make this work correctly, but it shouldn't be too hard.
  • sPortsLock could be made a R/W lock. In the most common use cases (reading from/writing to a port) there aren't any changes done to the data the lock guards.
  • A port-by-name hash table should be added, so that find_port() doesn't have to iterate through all ports. The function is e.g. used whenever a BRoster is created.

in reply to:  85 comment:86 by jua, 11 years ago

Replying to bonefish:

Since sPortsLock doesn't only protect the ports hash table but also Team::port_list and Port::owner only using a lock-free data structure won't suffice. Maybe you addressed that as well?

In the lockfree port list version, I left those protected by a mutex currently... so this mutex is still locked on e.g. port creation or deletion. By far most operations on the port list are lookups though, and those need no lock anymore.

That requires making Port a BReferenceable, so get_locked_port() needs to hold sPortsLock only while incrementing the reference.

Heh, in the lockfree version that is already the case :) Since any number of threads can concurrently lookup/insert/delete on it, and when removing an element you can never know who still has a reference in their hands, the ports had to be become refcounted (using KernelReferenceable though to be safe - would BReferenceable be enough?). So in parts I can reuse that...

The other suggestions I will look into. Would be interesting to compare a version with RW-locks versus the lockfree one (although lock-free has no lock wait times, it comes with the price of a higher overhead per operation, i.e. it is only an advantage when concurrency is high).

comment:87 by jua, 11 years ago

Some status update...

  • I created another branch in which I'm experimenting with the RW-lock guarding. At first it was not better than an unmodified Haiku, it still had the heavy freezing. I suspected classic writer-starvation and wanted to see where those writers came from. When I added some tracing-dprintf's I learned another thing...
  • ... In my previous comment I wrote that I left the team-port-list locked by a regular mutex even in the lockfree-hash branch -- I had assumed that accessing it would be rare... oh how wrong I was!
  • Was surprised to find that just moving the mouse cursor causes torrents of calls to set_port_owner(), constantly changing the ownership of a port back and forth between registrar and whatever application is responsible for the area below the mouse pointer. I so far know only little about app_server/registrar internals, so I just assume that this ownership-pingpong is intentional and not a bug? Anyway, set_port_owner() is not as lightweight as the function name makes it seem, it has to lookup a team ID, lookup a port and then atomically move the port from one list to another... and all that protected by a single mutex (and in the RW-lock-branch version it always acquired the lock for writing)
  • So another way to handle the team-port-list was needed. Two solutions came to mind: a) a separate lock per team as part of the team data structure or b) lock striping. I've gone with b) for now and it works well enough. Instead of a single mutex guarding all team-port-lists there is now an array of 8 mutexes. When code wants to access the list of team x it has to lock mutex number (x % 8). The choice of '8' is somewhat arbitrary, maybe 16 would be better, or maybe not... guess I need to benchmark there.
  • With the lock-striping on the team-port-lists, the situation is much better for the RW-lock-version. It now has as little freezing as the lockfree-hash version -- i.e. the freezing is not totally gone, but much better. I've then also integrated the striping into the lock-free branch, there it doesn't seem to make much difference though in regards to freezing.
  • I've also implemented the ports-by-name hash and using that in the RW-lock version for find_port() lookups. Doesn't influence the freezing though because find_port() is rarely called (at least in my current test cases). The remaining suggestion about the locking order is still to be tried out.
  • I'm curious why there is still a little freezing in both branches. These remaining issues are most probably unrelated to the ports list, maybe contention on another kernel structure or something else. This is the next thing I want to find out about.

in reply to:  87 comment:88 by bonefish, 11 years ago

Replying to jua:

  • Was surprised to find that just moving the mouse cursor causes torrents of calls to set_port_owner(), constantly changing the ownership of a port back and forth between registrar and whatever application is responsible for the area below the mouse pointer. I so far know only little about app_server/registrar internals, so I just assume that this ownership-pingpong is intentional and not a bug?

That is due to how synchronous messaging is implemented (cf. the BMessage implementation). A temporary reply port is transferred to the target team before sending the message. This way, if the target team dies or sending the reply fails for some reason, the reply port is deleted and the thread waiting for the reply on that port wakes up. After retrieving the reply the port is reclaimed. So for a successful synchronous message exchange one sees two set_port_owner()s, besides the less problematic two write_port()s, two read_port()s, and two get_port_info()s.

TBH, I don't find this strategy particularly elegant and while it works ATM, it will fail once we go multi-user and restrict port operations as part of the security concept. So at some point we'll have to extend the API (or use a completely new mechanism).

It's probably a good idea to have a look what kind of messages the registrar is sending and whether it would be possible to avoid synchronous messaging in this case.

Anyway, set_port_owner() is not as lightweight as the function name makes it seem, it has to lookup a team ID, lookup a port and then atomically move the port from one list to another... and all that protected by a single mutex (and in the RW-lock-branch version it always acquired the lock for writing)

  • So another way to handle the team-port-list was needed. Two solutions came to mind: a) a separate lock per team as part of the team data structure or b) lock striping. I've gone with b) for now and it works well enough. Instead of a single mutex guarding all team-port-lists there is now an array of 8 mutexes. When code wants to access the list of team x it has to lock mutex number (x % 8). The choice of '8' is somewhat arbitrary, maybe 16 would be better, or maybe not... guess I need to benchmark there.

More is better. :-) Mutexes are really cheap, no need to be skimpy. Alternatively it might be possible to use the the Team lock (needs checking) or even introduce a new per-Team mutex for ports (and maybe other resources).

  • I've also implemented the ports-by-name hash and using that in the RW-lock version for find_port() lookups. Doesn't influence the freezing though because find_port() is rarely called (at least in my current test cases). The remaining suggestion about the locking order is still to be tried out.

I listed the locking order change first (before switching to a R/W lock), since I think that is a big deal. After that change there should be very little work be done while the global lock is being held -- most importantly no other locking! -- so it should be a lot less likely that a thread is preempted while holding the lock. So with that change alone I'd expect the lock contention to drop dramatically.

comment:89 by jua, 11 years ago

Another little update:

  • Searched for the next reason of remaining freezing and it was the port quota lock, it got acquried as well on every message send and receive. By reordering the control flow a little and using atomic_add() in some places I could reduce its usage: it now only needs to be locked when new areas are added to the port message heap.
  • Freezing was now even rarer and getting harder to reproduce, but it was still possible. Found the next obstacle to be the heap's page lock. I didn't really want to dive into the heap implementation so I wondered whether there's a way to reduce its usage in general. Added some code to gather statistics on message sizes. In some general system usage, it turned out that
    • >50% of messages are 1-64 byte in size
    • >35% are 65-256 byte.

So if we stop using the heap for messages <= 256 byte, it will affect 80-90% of them. I added some code to use slab object caches for these cases instead of using the heap allocator. There are two caches: one for messages 1-64 byte, another for 65-256 byte. Everything larger still uses the heap.

  • With that, I now have a test-version where I can't reproduce freezing at all anymore. Still could reproduce occasional audio clicking under worst conditions, but those are when it's waiting on some media-kit-internal semaphore and not a kernel lock. The mouse cursor always keeps moving smoothly.
  • Next up: further optimization, e.g. the change in locking order. Maybe it could also be beneficial to not use just one object cache for both "size classes" (1-64, 65-256), but several, and use them round-robin (since the object cache has a lock as well, although it doesn't cause problems in my tests so far).

comment:90 by bonefish, 11 years ago

Hey jua. Good to see you progressing. Regarding the heap the port code uses, it is not *the* kernel heap allocator. It's the old heap allocator code that is no longer used elsewhere. The kernel heap uses the slab allocator. The port heap still uses the old heap allocator, since it supports B_NO_LOCK areas (i.e. swappable areas), which the slab allocator doesn't.

If switching to the slab, it would probably be better to just switch to the actual kernel heap instead of using object caches manually (the kernel heap uses more fine grained size classes).

comment:91 by axeld, 11 years ago

jua do you happen to have your changes on a github repository? It would be nice to be able to test your changes, and also to have your changes posted to the commit list if you like.

comment:92 by jua, 11 years ago

do you happen to have your changes on a github repository?

Not yet, but it's a good idea. Created a github account now, forked the haiku repo and I will start putting the changes into it tonight: https://github.com/orangejua/haiku I'm not sure how the commit list sync is done -- is it something I have to set up in my github account or is it done by mailing list admins?

comment:93 by anevilyak, 11 years ago

Oliver can handle that, will let him know.

comment:94 by jua, 10 years ago

patch: 10

comment:95 by jua, 10 years ago

The code is now in a state where I consider it "finished" in terms of the things I planned to implement. I will attach two patch files to this ticket:

  • Patch 0001 containing the work towards restructuring the port subsystem locking
  • Patch 0002 with a change to the interface kit / app_server to no longer create those torrents of calls to set_port_owner() on mouse moves.

The patches can be applied independently and should work cleanly on current Haiku master branch. Their message headers contain a summary of the changes.

You can also find these changes as 8 individual commits in the "ticket8007"-branch of my github repo (see above)... however, please note that some of these intermediate steps were not entirely correct (the final lock-restructuring commit solved some atomicity violations I had unknowingly introduced in a previous commit).

Would be nice if some people could test these changes, especially those who suffer from these "freezing"-problems. Please remember though that these changes have seen only little testing so far and are thus *experimental*, so be careful, don't use it on your main installation or anything and don't let it eat your data!

by jua, 10 years ago

Patch 0001: port system lock restructuring

comment:96 by jua, 10 years ago

patch: 01

by jua, 10 years ago

Move B_MOUSE_IDLE generation to app_server

comment:97 by anevilyak, 10 years ago

Owner: changed from axeld to jua
Status: newassigned

comment:98 by anevilyak, 10 years ago

They seem to work fine on my system over here at least.

comment:99 by diver, 10 years ago

Seems to work fine here (in vbox) too.

comment:100 by jua, 10 years ago

Thanks for the feedback :)

If noone objects, I'd like to push these two changesets to Haiku master by the end of next week. We could of course delay it if someone still wants to review the code first.

To anyone else seeing the problems discussed herein, maybe you can find the time for testing until then.

comment:101 by beos_zealot, 10 years ago

Today is the 4th day of testing HAIKU hrev46240 build with 0001 and 0002 patches on real hardware - i am impressed and pleased with increased systems performance.

I didn't do any special tests, just was using Haiku and doing my spare time daily routines (compiling, downloading, listening music, watching movies and etc.)

Before patch it was impossible to listen music in MediaPlayer while building Haiku (jam -j2) - sound was crackling and breaking, after patch - MediaPlayer plays music smoothly during Haiku compilation (jam -j2) and i can even download torrents at the same same time now! (http://www.flickr.com/photos/37064202@N05/10365441333/)

Patchset is obviously recommended and I hope it will be pushed to HAIKU master ASAP.

Thanks Julian, your work is appreciated.

Last edited 10 years ago by mmadia (previous) (diff)

comment:102 by SeanCollins, 10 years ago

due to this being a behavioral bug, and the fact that these changes need widespread testing, shouldn't these patchs be applied for the beta freeze cycle, this problem is a real annoying pest and it makes haiku review poorly

Last edited 10 years ago by SeanCollins (previous) (diff)

comment:103 by diver, 10 years ago

@beos_zealot: Out of curiosity, what font do you use on this screenshot?

comment:104 by beos_zealot, 10 years ago

@diver: it's Droid Sans family fonts (DroidSans.ttf, DroidSans-Bold.ttf and DroidSansMono.ttf). So far the best font in combo with subpixel font rendering to my taste.

comment:105 by jua, 10 years ago

Resolution: fixed
Status: assignedclosed

Since no one vetoed, I have commited the patches now (hrev46290).

That should close this ticket. If anyone still experiences these problems, please reopen.

comment:106 by AlienSoldier, 10 years ago

Hi jua, sorry for the delay in testing this. I still experience tiny sound gaps (currently testing hrev46301). Perhaps once the work on scheduler will be done it will be better.

I can still produce it on my Acer Aspire one and also on my main haiku testing machine (a P4 hyper-threading 2.99Ghz). If one of the HT core is disabled it is lot easier to do.

The simple way i have to create this is to launch a video and just start the haiku about and it does it. I assume starvation happen as i see lot more artifact when dragging windows than when using R5. It is at least usable right now, but not perfect.

comment:107 by jua, 10 years ago

With recent nightlies, I cannot reproduce it anymore, even on an old machine (from 2006, Athlon64 1.8GHz single core).

So the question is, is it the same problem or something else entirely. Maybe these small dropouts are simply caused by syslog output to the serial port. Could you try to disable it?

To do that, open the file /boot/home/config/settings/kernel/drivers/kernel. In the first few lines you should see

#serial_debug_output false

Remove the # in front of it, i.e. change the line to

serial_debug_output false

Then reboot. Is it still reproducible then?

If yes: are these small dropouts or larger periods of freezing? Does e.g. the mouse cursor freeze too?

comment:108 by AlienSoldier, 10 years ago

"Then reboot. Is it still reproducible then?" Yes, note that the gap i mention are never much than 1 sec now (still annoying that said).

"Does e.g. the mouse cursor freeze too" Not that i feel (and it should be apparent because i use a fast mouse accelerator setting).

Things that may slow things down: I use USB sticks (for the media played and the booting OS) because i still don't have HDs for that machine and the laptop HD is a slow one (5200 rpm). Those USB stick are not super fast, but not slow either. The P4 video card is native driver but because it is a server board, it have no AGP and the on-board video card is PCI based. The soundcard is an SBlive emuki. Both keyboard and mouse are PS/2. Just moving fast a window around can bring the cpu meter to the roof (i always found that terribly inefficient, especially that this should not require more than moving a pointer and do some clipping).

Don't get me wrong, it is LOT better than last time i tested it, but i still get audio gap. I get less on my R5 machine with smallest spec and no HT, the only thing that can get that one to skip sound is with seamonkey/firefox when those bang the cpu or the virtual memory hard.

I did not use VLC (because i have overlay stretching problem with the video chip on that PC) so i use mediaplayer for video. I don't have overlay at all in it, just "drawbitmap". I notice that if i lower the color depth from 32 bit to 16 bit i "seem" to get better result at 1 HT core and it become really really hard to reproduce with 2 HT core, so it may be related to the app server and video bandwith. I plan to test it tonight on my 1080 main monitor and see if it can skip more (that PCI card can do 1080 if i keep it in 16 bit). Right now i am testing at 1024/768. The crt monitor can do 1280x1024 and it seem to create more sound skip at that resolution.

comment:109 by diver, 10 years ago

Is this with vesa driver? If not, can you check again in vesa mode?

comment:110 by jua, 10 years ago

Since it's only the audio skipping and the mouse still moves fine, it seems to be a different issue to me, see ticket #7882. I plan to post some instructions on debugging it there as soon as some MediaKit changes are in place.

comment:111 by AlienSoldier, 10 years ago

The VESA test was interesting, it seem a bit faster than the native driver :), that was unexpected to me (but i doubt i can do 16:9 with it and overlay is needed for large screen as the zooming is too taxing). With VESA it is even harder to skip at 2 HT core but still possible with 1 HT core.

As for #7882, it might very well be something similar. i need to put firefox and seamonkey to low priority on R5 to avoid sound skip (only those 2 apps)."MediaKit fixes which I will commit soon (next 1-2 weeks maybe)." Talk about fast service! :)

comment:112 by diver, 10 years ago

You haven't mentioned your graphic card/driver ;-)

comment:113 by AlienSoldier, 10 years ago

To make some publicity to one of my other tiket, it it mentioned here: https://dev.haiku-os.org/ticket/9917

comment:114 by diver, 10 years ago

I wonder if it's related to #2769? Could you try this patch ticket:2769#comment:11?

comment:115 by AlienSoldier, 10 years ago

Made some test on a hard drive i found (wanted to be sure it was not caused by lack of DMA or something) and on my main PC (P4 "NOT" HT) with a NVIDIA GeForce4 MX 440 with AGP 4X 64mb to be having result from an AGP card. That one is kept at 1800Mhz for longevity reason (it have not so great capacitors).

On the 1080P monitor, like expected (i tested it with my P4 HT also) as i can move larger windows around, i can make sound skip more easily. The HT enabled still play a big part in removing the skipping but with large enough resolution i can still get it to skip. I did not even try my P3, as it will surely be easier to replicate it there.

The problem for me really seem to only appear at 100% cpu usage, so it does not seem related to lock of any kind anymore (at least from what i see).

I will retry the whole set of test with media kit changes once i am notified about their integration.

Note: See TracTickets for help on using tickets.