Opened 6 years ago

Closed 5 years ago

Last modified 5 years ago

#14755 closed bug (fixed)

Haiku Grinds to a Halt when Dribbling

Reported by: AGMS Owned by: leavengood
Priority: normal Milestone: R1/beta2
Component: Servers/media_server Version: R1/Development
Keywords: Cc: agmsmith@…, ttcoder
Blocked By: #4954 Blocking:
Platform: All

Description

Haiku fails after a few days in radio station use. I was able to recreate a similar problem by dribble reading files (filling the cache) while playing sounds and appending to other files. The cache ate up all memory and then all sorts of things started failing. Would be nice if the cache wasn't so aggressive in using all memory.

To recreate it, I was using a 640MB VirtualBox running Haiku R1B1+100, running with virtual memory turned off to save time and keep it simpler. I was playing a sound every 10 seconds (using the "flite" voice synthesis to read a fortune cookie), appending to several files every 10 seconds (see attached program), and reading a file every second from a 50GB collection (using find | xargs md5sum ; sleep 1s).

It seems that the sound playback leaks memory areas and the cache uses up other memory. The combination leads to fragmented memory, which the cache doesn't recognise as being a shortage condition, so it keeps on allocating. Then all sorts of things fail when they are unable to get a block large enough, from forks to disk writes to GUI. Eventually the OS gets stuck (windows frozen).

Ideally, the cache would be using memory areas in an address independent way and you could unmap areas and then remap them in a contiguous address range. Same for the sound buffers. If that's too difficult, the cache low memory detection could also look for fragmentation. Or the media system could stop leaking areas.

The quick fix is to restart the media system, so that it frees its areas. But memory will still be kind of fragmented. Would be nice to similarly restart the cache, or get it to drop all its memory areas somehow.

Attachments (5)

AreaListing.txt (239.1 KB ) - added by AGMS 6 years ago.
Memory areas in use listing.
ForkedErrors.txt (346 bytes ) - added by AGMS 6 years ago.
Forking and other errors when memory is fragmented.
syslog.txt (294.6 KB ) - added by AGMS 6 years ago.
Syslog at the time.
TestFileAppend.cpp (8.2 KB ) - added by AGMS 6 years ago.
TestFileAppend program for appending to a file slowly, then reading it back to see if it got corrupted.
SayFortunes.sh (193 bytes ) - added by AGMS 5 years ago.
Bash script that uses the Flite (pkgman install flite) to read aloud fortunes. Each fortune runs media player to say a generated wave file, testing media player cleanup code.

Download all attachments as: .zip

Change History (30)

by AGMS, 6 years ago

Attachment: AreaListing.txt added

Memory areas in use listing.

by AGMS, 6 years ago

Attachment: ForkedErrors.txt added

Forking and other errors when memory is fragmented.

by AGMS, 6 years ago

Attachment: syslog.txt added

Syslog at the time.

by AGMS, 6 years ago

Attachment: TestFileAppend.cpp added

TestFileAppend program for appending to a file slowly, then reading it back to see if it got corrupted.

comment:1 by waddlesplash, 6 years ago

Ideally, the cache would be using memory areas in an address independent way and you could unmap areas and then remap them in a contiguous address range.

This is only a problem on 32-bit systems where the address space is easily exhausted. On 64-bit systems there is so much spare address space that this is never an issue. So, spending however much time to implement an "address space defragmenter" is probably not very high on any priority list; just use a 64-bit build if you need to.

If that's too difficult, the cache low memory detection could also look for fragmentation.

Failure to insert things into the kernel address space should issue a low-resource notification; but it didn't because the code was if 0'd out. hrev52637 (& hrev52639 which amends it) should improve things here significantly.

Or the media system could stop leaking areas.

Obviously this is a problem. If killing the media_server fixes it, then that narrows down the possibilities. Though I'm a bit confused here: is it the kernel address space or the userland address space that gets fragmented? No matter how fragmented physical memory gets, it should be possible to allocate (non-contiguous) virtual memory as long as there is some still free.

The kernel has its own 2GB address space, and each application has its own 2GB; and so even if the user application's address space gets especially fragmented, the kernel should be fine. (Or am I missing something here?)

Or, are you just plain out of memory altogether?

in reply to:  1 comment:2 by cb88, 6 years ago

Or, are you just plain out of memory altogether?

This from what he said if you note that he turned off the swap file to make it occur quicker.

Sounds similar to why booting on low memory systems doesn't work... where there it's just read a bunch of hpkg files from USB/CD etc, they end up cached, until you run out of ram and it grinds to a halt. My Tyan PII box can't boot haiku anymore with 512MB last I checked.

comment:3 by ttcoder, 6 years ago

Cc: ttcoder added

comment:4 by AGMS, 6 years ago

Good point, a lot of those buffers are in the media server process. So it's actually running out of memory system-wide, rather than kernel address space.

I'll see how it fares with more memory in some new tests.

Like you say, the longer term workaround is to use 64 bits, and hope that the page tables can grow.

Still, it would be nice to have an OS that can play sound and read files over a longer period of time, like BeOS used to do.

comment:5 by waddlesplash, 6 years ago

If it's running out of memory then changing to 64-bit won't help.

Obviously we should be able to play audio for any length of time and not run out of memory. So this is a media server bug then.

in reply to:  5 comment:6 by cb88, 6 years ago

Replying to waddlesplash:

So this is a media server bug then.

Not necessarily, could be a filesystem / cache bug etc...

comment:7 by waddlesplash, 6 years ago

Please reread the ticket history. Cache memory increasing is a problem but this should have been solved at least partially by my recent changes. The ticket itself notes that restarting media_server fixes the other issues.

comment:8 by waddlesplash, 6 years ago

Component: System/KernelServers/media_server
Keywords: memory fragmentation long duration removed
Owner: changed from nobody to Barrett
Status: newassigned

comment:9 by AGMS, 6 years ago

With more memory, it runs better (not running out of virtual address space either). Though after an overnight run, there are 32000+ areas used by the media system, sound doesn't work, and opening ProcessController takes about 4 seconds and sometimes doesn't draw the whole graph display (guess it's adding up those areas). Restarting the media server got sound working.

Overnight the logs just show kernel memory being slowly used up, and then some funny stuff when I started using the GUI at 11:00 (oddly exactly on the hour):

2018-12-12 22:27:08 KERN: slab memory manager: created area 0x91001000 (10389598)
2018-12-13 00:08:54 KERN: slab memory manager: created area 0x91801000 (13347565)
2018-12-13 02:01:12 KERN: slab memory manager: created area 0x92001000 (16589412)
2018-12-13 03:49:52 KERN: slab memory manager: created area 0x92801000 (19776951)
2018-12-13 05:42:01 KERN: slab memory manager: created area 0x93001000 (23021653)
2018-12-13 07:31:59 KERN: slab memory manager: created area 0x93801000 (26240792)
2018-12-13 11:00:00 KERN: add_memory_type_range(29929999, 0x90000, 0x70000, 0)
2018-12-13 11:00:00 KERN: remove_memory_type_range(29929999, 0x90000, 0x70000, 0)
2018-12-13 11:00:00 DAEMON 'app_server': Application for user 0 does not support the current server protocol (0).

comment:10 by Barrett, 6 years ago

Blocking: 4954 added

This is a longstanding bug. If you want to prove me wrong bisect ;)

The buffer management is completely flawed unfortunately.

comment:11 by korli, 6 years ago

Owner: changed from Barrett to nobody

comment:12 by waddlesplash, 5 years ago

So, media_server didn't actually clean up buffer resources on crashes/non-clean quits due to a typo. This was fixed in hrev53368. Depending on how flite is exiting, it may trigger that code; so please retest this and see if it's any better.

comment:13 by leavengood, 5 years ago

Owner: changed from nobody to leavengood

Buffers are still leaked even after the recent change. This seems to essentially be the same bug as in #4954. I will keep both bugs open for now.

As Barrett says this is a longstanding bug. While the buffer management is indeed very complicated and could be improved, I don't think the core buffer management is the problem. Certainly for the buffers leaked in media_addon_server there seems to be a media add-on which creates a BBufferGroup that it never destroys. Since this occurs every time you try to play audio from a file, I suspect BMediaFile or the media_reader add-on or something like that.

I also want to run the sample program from #4920 to see what buffer behavior it exhibits, since it is not loading audio from a file, but generating it directly. If it also has the problem then BSoundPlayer (or something it uses) may be the culprit.

A lot of this code is very old, crufty and full of XXX or TODO comments (media_reader add on is quite bad in this respect.) Now maybe most of those things are not issues but it might be good to give much of the media related code an audit. The fact that it is spread among two servers, a kit, two header locations and a ton of addons does not help. Plus MediaPlayer or media_client, etc.

Even if this buffer leak is fixed, I think MediaPlayer also leaks memory in other ways, but I believe that is another issue.

comment:14 by Barrett, 5 years ago

I don't agree. If the media_server is not able to detect when a process exit without releasing stuff (and consequently kick out those resources), then the whole system is just doomed.

in reply to:  14 comment:15 by leavengood, 5 years ago

Replying to Barrett:

I don't agree. If the media_server is not able to detect when a process exit without releasing stuff (and consequently kick out those resources), then the whole system is just doomed.

It really is just this one case of the buffers in the cache in BBufferConsumer, as far as I can see. The normal case now does not leak buffers, and I think that was 99% of the problem. Since we can associate buffers in the cache with port_ids now, we should also be able to detect when a port goes away and then clean up. Well that is at least what I plan to look at next.

Seriously though, now that I understand this better, I am curious how you would change it? I know you are annoyed with Haiku and probably me, but I would actually like your opinion. Because too many things are definitely involved here and trying to share the same buffers everywhere is the core issue. If you don't think it can be fixed because of the crazy Media Kit design, I am starting to agree but unfortunately for the moment we cannot fix that. Even if it seems dumb to support a 20 year old design there are people still using it, in fact, one pretty important one: TuneTracker.

comment:16 by cb88, 5 years ago

@leavengood you may have missed but but it bears mentioning, TuneTracker has mentioned plans of moving away from Haiku due to lack of stability. So if things get fixed we may retain them otherwise they are potentially going away. I think stability seems to be on an uptick though at least...

It's unfortunate that most all of this is C++ as this sounds like an ideal situation for Rust to step in and solve the buffer sharing problems... It makes you wish there were a C++ subset/Extension that was smart enough to do the same things. When such check do exist in C++ the developer can often accidentally just walk right around them or botch it up in some way or other.

comment:17 by ttcoder, 5 years ago

They're very aware, I've been reminding them often enough *grin*. Collaboration is going great in off-ticket emails, we're watching developments like hawks, no worry. As soon as the fixes are commited to trunk we'll start testing (/me knocks on wood).

in reply to:  16 comment:18 by leavengood, 5 years ago

Replying to cb88:

@leavengood you may have missed but but it bears mentioning, TuneTracker has mentioned plans of moving away from Haiku due to lack of stability. So if things get fixed we may retain them otherwise they are potentially going away. I think stability seems to be on an uptick though at least...

Yeah, waddleplash kind of brought me in on this when I expressed interest in fixing the Media Kit problems.

It's unfortunate that most all of this is C++ as this sounds like an ideal situation for Rust to step in and solve the buffer sharing problems... It makes you wish there were a C++ subset/Extension that was smart enough to do the same things. When such check do exist in C++ the developer can often accidentally just walk right around them or botch it up in some way or other.

To be honest, the bug I fixed could have occurred in Rust too. It was not a memory leak but more of a cache invalidation bug. But Rust would probably improve the situation in other areas.

Rust can sometimes be extremely and annoyingly constraining but usually for a good reason. As I have said before on the forums I want to make Rust a first class citizen in Haiku, but have quite a lot of other things I want or need to do before I worry about that. I also don't know how other Haiku developers would feel about using Rust in any core Haiku components. I will saying writing Rust and really thinking about ownership has made be a better C++ programmer. The freedom you get in C and C++ is nice but of course freedom comes at a cost and it is easy to make serious mistakes. But this argument has been made 1000 times already by Rust and C++ evangelists :D

comment:19 by waddlesplash, 5 years ago

Blocked By: 4954 added
Blocking: 4954 removed

comment:20 by waddlesplash, 5 years ago

This should be fixed by hrev53379. Please retest.

by AGMS, 5 years ago

Attachment: SayFortunes.sh added

Bash script that uses the Flite (pkgman install flite) to read aloud fortunes. Each fortune runs media player to say a generated wave file, testing media player cleanup code.

comment:21 by leavengood, 5 years ago

I also wrote a script which prints out the count of cloned buffers in media_server and media_addon_server. I am not on that machine now but it basically is:

echo media_server:
listarea media_server | grep "buffer clone" | wc -l
echo media_addon_server:
listarea media_addon_server | grep "a BBuffer clone" | wc -l

I might have the first grep wrong, you can do a full listarea and the buffer clones are obvious (to get the right string to grep for.) If you run this before and then after a bunch of runs of flite or whatever, the numbers will now be the same.

One caveat: crashing apps don't send the message I use to clean up the buffer cache, so we will need another solution there. I strongly suspect I can figure out when a particular port_id is gone though and use that for cleanup.

Version 0, edited 5 years ago by leavengood (next)

comment:22 by AGMS, 5 years ago

After several hours of intensive testing on hrev53379/gcc2 (VirtualBox with 2GB RAM, swap file 981MB, 1 CPU core, AC97 audio), it seems to be working much better.

The test simultaneously: dribble appended to a text file, md5summed all the files on a 120GB disk volume (reading lots of data, stressing the cache), and read aloud a fortune cookie (media player called once for each fortune). The MediaServer cloned buffers (listarea | grep cloned | wc) didn't grow, stopping at 15 memory areas while running (previously it grew by 2 for each fortune played). The system total number of areas went up a bit due to disk caching, but miraculously shrank back after the system was idle for a few minutes!

The next thing to do is to test it with real radio station software, for a week.

Last edited 5 years ago by AGMS (previous) (diff)

comment:23 by leavengood, 5 years ago

Good news!!!

I still worry about the same application running for long periods using more memory as it runs, so your second test might be a bust. But I will be digging into that bug more over this next week.

comment:24 by waddlesplash, 5 years ago

Resolution: fixed
Status: assignedclosed

Well, we can consider this fixed, and leave just #9458 then.

comment:25 by nielx, 5 years ago

Milestone: UnscheduledR1/beta2

Assign tickets with status=closed and resolution=fixed within the R1/beta2 development window to the R1/beta2 Milestone

Note: See TracTickets for help on using tickets.