Opened 10 years ago

Closed 9 years ago

Last modified 9 years ago

#11497 closed bug (fixed)

Media/mixer crash, either buffer "oops" or NULL indirection in float_to_float()

Reported by: ttcoder Owned by: Barrett
Priority: critical Milestone: R1/beta1
Component: Add-Ons/Media/Mixer Version: R1/Development
Keywords: Cc:
Blocked By: Blocking: #11584
Platform: All

Description

Heavily related to #7180, in that media_addon_server crashes at boot-up in two different ways depending on the phase of the moon (sometime I get the SetBuffer() "oops" failed assertion message, sometimes I get a NULL src pointer in float_to_float() -- see below for details). Maybe also related to #5339

The "news" is that it occurs 100% of the time at boot-up here, so long as I wait a few seconds before opening MediaPlayer.. And luckily it still occurs even after I black-listed the stock mixer binary and started using a binary built from source here.. So I was able to add tracing and start digging.

Info coming up.

Attachments (1)

AAC-recording-that-breaks-media.wav (427.1 KB ) - added by ttcoder 10 years ago.
Opening this in MediaPlayer makes media_addon_server go into a tight loop, on 2 different setups.. Anyone else seeing this ?

Download all attachments as: .zip

Change History (23)

comment:1 by ttcoder, 10 years ago

So for a few weeks now I've been hanging tight to hrev48168, not upgrading it, as it reproduces the bug reliably at boot-up (contrarily to previous revs which were irritatingly erratic in triggering the crash).

Here's my findings, and why I'll need help to further track this down..

After some groping the dark, I ended up with this mod of Resampler::float_to_float():

void
Resampler::float_to_float(const void *_src, int32 srcSampleOffset,
	int32 srcSampleCount, void *_dest, int32 destSampleOffset,
	int32 destSampleCount, float _gain)
{
	register const char * src = (const char *)_src;
TRACE("src=%p \n", src);
*dest = *src;
  ...

result: When crashing, debugger shows it crash after a handful of assembly lines, instead of further down... And of course, the TRACE() (enabled with -DDEBUG=2) shows this in syslog, which also confirms that the src pointer is NULL:

KERN: ### ERROR: BAD data! from buffer %ld
KERN: src=(nil)
KERN: vm_soft_fault: va 0x0 not covered by area in address space
KERN: vm_page_fault: vm_soft_fault returned error 'Bad address' on fault at 0x0, ip 0x172bca6, write 0, user 1, thread 0x250
KERN: vm_page_fault: thread "Audio Mixer control" (592) in team "media_addon_server" (577) tried to read address 0x0, ip 0x172bca6 ("mixer.media_addon_seg0ro" +0x1aca6)
KERN: debug_server: Thread 592 entered the debugger: Segment violation
KERN: stack trace, current PC 0x172bca6  float_to_float__9ResamplerPCvllPvllf + 0x3a:

So I added an error statement in MixerInput::BufferReceived(), conditional on buffer->Data() == NULL and it gets triggered as well (shows up above too).

Then I ended up in AudioMixer::HandleEvent() which has this dangerous bit of code, doing an unconditional C-style unchecked cast (instead of a dynamic_cast ..etc):

AudioMixer::HandleEvent(const media_timed_event *event, bigtime_t lateness,
	bool realTimeEvent)
{
	switch (event->type) {
		case BTimedEventQueue::B_HANDLE_BUFFER:
		{
			HandleInputBuffer((BBuffer *)event->pointer, lateness);
			((BBuffer *)event->pointer)->Recycle();
			break;
		}

And there I am stuck, as the rest of the call chain is no longer in the mixer code, it gets into media server land ..etc.

Any idea how to proceed next?

*EDIT*: I'm not completely making sense of the tickets lineage but it seems...

  • #6188 is the master ticket for SetBuffer() "oops" failed assertion, with lots of duplicates
  • #7180 is the master ticket for float_to_float() crashes on NULL pointers (though #10492 has a discussion on which parameter is NULL, src or dest, noting that in the context of this ticket it is src`which is NULL)

P.S. I'm not suggesting to 'fix' this bug by simply doing if if buffer->Data() == NULL return as we don't know the underlying cause yet.. and there are hints that the root cause can trigger completley different crashes than this one, depending on chance.

Last edited 10 years ago by ttcoder (previous) (diff)

comment:2 by ttcoder, 10 years ago

Added a bit of instrumenting on the suspect BBuffer. First, here's the output for a normal (or at least, non-crashing :-) run:

KERN: Data()=0x20f000 Flags=0x0 ID=4 Type=1 SizeAvailable=14112 SizeUsed=14112 Size=14112

Here's what I see when it crashes:

KERN: AudioMixer::GetNextInput
KERN: Last message repeated 2 times.
..
KERN: Data()=(nil) Flags=0x4 ID=4 Type=1 SizeAvailable=378 SizeUsed=14112 Size=378
..
KERN: ### ERROR: BAD data! from buffer 0x1845ddb0
KERN: vm_soft_fault: va 0x0 not covered by area in address space

So presumably no clear indication that it is a mis-cast; the flags especially, differ (but neither make sense compared to the allowed values anyway ?? ).. So next I'll assume this is indeed a BBuffer, and try to find who is responsible for setting its fData field..

EDIT: hmm actually SizeAvailable() and SizeUsed() are suspect as all heck: this page says that SizeAvailable() is the size of the allocated memory, implying that SizeUsed() should be smaller or equal to it.. So writing 14112 bytes into a 378 memory alloc is a massive heap corruption, or am I missing something?

Other than that, this line implies that BBuffer::Data() is a common occurence error condition and should be handled properly..

Last edited 10 years ago by ttcoder (previous) (diff)

comment:3 by vidrep, 10 years ago

I wish you good luck in trying to nail the cause of the various media crashes in Haiku. They're probably the single most obvious and annoying bugs that Haiku has right now.

comment:4 by ttcoder, 10 years ago

Note to self: the intruiguing fFlags value must originally come from the buffer_clone_info here called here..

==

I guess it's indeed possible to find a code path not related to out-of-bounds access that still explains what I'm seeing, for instance this sequence of events...

... followed by a SetHeader() somewhere that would set "wild" AvailableSize ..etc values.

That path includes errors/asserts but they are invisible because of the release builds; It seems a good idea to run MediaPlayer next to a lib/ folder containing a debug build of libmedia.so, and/or run media_server in debug mode, to see if those asserts and errors get tripped. Will try that in a few days. For now I'll try to see if an if Data()==NULL return makes me fall back to the "oops" assert, out of curiosity.

@vidrep TBH I'm narrowly focusing on "oops" and float_to_float as those are the ones I have a reliable reproducible case for, and even then I'm trying to collect data for those who will actually know how to fix it ;-) Well I might get lucky and stumble on a fix, but that has seldom happened in the past for this humble app developer.. Saw your tickets about failed asserts in ffmpeg though, seems to me they would be alleviated if it was possible to revert to ffmpeg 0.10 ; pulkomandy announced he'll work on ffmpeg again soon

EDIT: yea if I ignore/skip the the NULL Data BBuffer, media_addon_server quikcly goes "oops", interesting. The two tickets clearly seem to be linked. Syslog:

KERN: Data()=(nil) Flags=0x4 ID=4 Type=1 SizeAvailable=425 SizeUsed=14112 Size=425
KERN: ### ERROR: MixerInput::GetMixerChannelInfo: reading wrong data, have -160001 to -1, reading from 19306666 to 19359999
..
KERN: debug_server: Thread 592 entered the debugger: Debugger call: `oops'
..
KERN:   (0x79d63ab8)  0xfbfdca  SetHeader__7BBufferPC12media_header + 0x32
KERN:   (0x79d63ae8)  0xfc0e30  HandleMessage__15BBufferConsumerlPCvUl + 0x19c

EDIT2: Looks like I'm gonna have to retool to debug the "oops" tripped assert from now on: now that I'm running with a debug build of libmedia.so I can't get the NULL pointer in float_to_float() any more, I always get the "oops", even though not much has changed, I've just compiled it with DEBUG=3 to get the debug output, sigh.

Last edited 10 years ago by ttcoder (previous) (diff)

comment:5 by ttcoder, 10 years ago

More data points..

I have a loaned old laptop with a Pentium-M and auich audio, which worked ~98% well with an older 477xx rev; upgraded it this morning to hrev48455 and now it behaves like my thinkpad: it trips the "oops" assert after startup when playing anything. So it seems..

  • the bug occurs with both HDA and Auich ICH, so probably not driver dependant.
  • things got worse in last month's revs, even though no relevant changes occured.

So those all hint at something timing/race-condition dependant.

Meanwhile I'm struggling to run media_server/media_addon_server in debug mode to collect more info.. Will get there eventually..

in reply to:  5 comment:6 by anevilyak, 10 years ago

Replying to ttcoder:

Meanwhile I'm struggling to run media_server/media_addon_server in debug mode to collect more info.. Will get there eventually..

Your simplest bet for that would be to build your own image with SetConfigVar DEBUG : HAIKU_TOP src servers media : 1 : global ; in build/jam/UserBuildConfig (insert more corresponding lines for media_addon_server and/or libmedia as needed).

comment:7 by pulkomandy, 10 years ago

Hi, The mixer code was refactored a bit today (in hrev48460). I doubt this changes anything, but mentionning it just in case.

comment:8 by anevilyak, 10 years ago

Blocking: 11584 added

(In #11584) Duplicate of #11497.

comment:9 by pulkomandy, 10 years ago

I can reproduce the "oops" crash quite consistently when playing Worms Armageddon (with the patch for the thread stack problem and the fixed game kit). However the issue goes away as soon as I compile the media kit with debug info (just the -ggdb, not even setting DEBUG=1), so I'm not able to debug it.

comment:10 by ttcoder, 10 years ago

I think I'm seeing something similar here depending on which combination of components I run, but the very main problem I'm running into is newbie confusion: I run a custom build of media_server, does not work (won't find media_addon_server by signature).. I make it launch by folder/path instead.. won't work (the resultant addon_server apparently does not communicate with the media_server, having failed to find add-ons and physical audio input/output).. So I dig into the code, e.g. DefaultManager::LoadState(), I add tracing, look for who's supposed to fill out the fMsgList member variable (apparently config file MDefaultManager), look for any kinda of reference on that file or a raw copy of it (opengrok comes up empty) and so on.. Never ending stuff.. :-o

So I'm taking extended breaks from debugging this every few days to keep my sanity ;-) from reviewing that mountain of (new to me) code and scanning for significant errors in the metric tons of warnings and errors that occur in a.. normally functionning server :-b.. and secretely hoping that somebody else more qualified runs into that same reproducible case *g*..

And indeef if/when I finally manage to run the whole suite in debug mode it might not reproduce the crash or the oops any more, so I might have to backtrack to my "last crashy state". I'll try to post useful progress reports at that time...

Anyway yes this seems to be load-dependant in relation to the available CPU power: powerful machines get the "oops" only at runtime if under heavy load, whereas older 2GHz laptops like this one even get it at the very beginning of a session, as the CPU Load that occurs at boot time is enough for them to crash already I guess (correction: "is enough to do something wrong with the initialization" -- the crash occurs when opening mediaplayer after the desktop appears).

comment:11 by pulkomandy, 10 years ago

So, to get debug infos enabled without making the crash go away:

  • Edit src/kits/media/Jamfile and add "SubDirC++Flags -ggdb ;"
  • Remove all files in generated/objects/haiku/x86_gcc2/release/
  • Build the lib again

With this I could reproduce the crash and have a look with Debugger. The crashed thread is "Audio Mixer control". It crashes in CONSUMER_BUFFER_RECEIVED (that was easy to guess even without the debug info).

This part of the code is about receiving media buffers, which comes from some producer. The buffers are recycled, after they are sent to the audio output, so they can be used by the producer again. This way, after an initial phase of creating buffers, things should stabilize and the buffers are reused in a loop.

To make this fast, the buffers are stored in a cache (this is fBufferCache). What is stored here is a small object with a pointer to the data. The actual data for the buffer is in shared memory so it can be shared with other applications.

The first time we get a buffer, it is not in the buffer cache yet and we create a new BBuffer object for it (this happens in GetBuffer). Then we fill that buffer with the relevant data and pass it to BufferReceived.

There is a check there (this is our "oops") that the buffer we get from the cache comes from the right producer, and not from another one. And, this check fails. This means our buffers are mixed up. Either the buffer cache got somehow corrupted, or maybe it still references buffers that were deleted.

I'm surprised by the complete lack of locking in that part of the code, which is probably on purpose to have good performance and low latency. I hope everything is carefully single threaded, but even then, it looks like things like a dropped message could put things out of sync here. How safe are ports with regard to this?

comment:12 by ttcoder, 10 years ago

IIRC think ports are created with a certain capacity, and <strike>start dropping new incomings if write_port incomings come fast enough to override that capacity before read_port removes them (my experience on ports is almost only with BLoopers: their port is created with something like 200 slots, and the thread tries to run as fast as possible to move them safely to a BMessageList where capacity is not limited... yet I know from trying unreasonable stuff with them that even that can be overwhelmed, I could e.g. send 201+ BMessages before the thread would wake up or be available).

==

If the above lead does not give easy results, also appreciated if you could try this tracing (which I cannot do here):

BBuffer::SetHeader(const media_header* header)
  ..
+		//or debug_printf() if TRACE() does not end up into syslog?
+		TRACE("fMediaHeader.buffer: %ld (wanted: %ld)  perf_time: %Ld (wanted: %Ld) Data(): %p\n", fMediaHeader.buffer, header->buffer, fMediaHeader.start_time, header->start_time, Data());
		debugger("oops");

and

BufferCache::GetBuffer(media_buffer_id id)
  ..
	if (buffer == NULL)
		return NULL;
+	if (buffer->Data() == NULL)
+		debugger("wow, this might be the reason we later hit 'oops' indeed!");

EDIT: reason being, we're attacking the problem from two opposite ends (under heavy load froma CPU-intensive app, and me it's right at the start when the BBuffers start being created at all), and maybe on this end we can find good hints too .. What I'm seeing here is that a NULL Data() is correlated to, or even a precursor (so maybe also the cause?) of the "oops" assert/debugger drop.

Still, I'd like to /also/ follow your lead too so I'm going to study the code some more :-p.. So far for me, who is sending what buffer in what thread to whom is still foggy.. I see that BufferCache is only referenced by BufferConsumer (and it also owns a BBufferGroup).. and there are at least a couple threads based on BufferConsumer in media_addon_server: AudioMixer, MultiAudioNode ..

Last edited 10 years ago by ttcoder (previous) (diff)

in reply to:  12 comment:13 by bonefish, 10 years ago

Replying to ttcoder:

IIRC think ports are created with a certain capacity, and start *dropping* new incomings if write_port incomings come fast enough to override that capacity before read_port removes them

The behavior of ports is well documented in the BeBook. Ports themselves never drop messages. When a port's capacity is exhausted subsequent write_port() calls will simply block. By using write_port_etc() the caller can decide not to wait indefinitely, in which case the message will not be sent. Which is the caller's decision and thus the caller has to deal with the consequences. I'm not familiar enough with the media kit code to say whether it uses timeouts and, if so, whether it handles those correctly.

(my experience on ports is almost only with BLoopers: their port is created with something like 200 slots, and the thread tries to run as fast as possible to move them safely to a BMessageList where capacity is not limited... yet I know from trying unreasonable stuff with them that even that can be overwhelmed, I could e.g. send 201+ BMessages before the thread would wake up or be available).

The C++ messaging API allows the same control. When sending messages you can decide whether you want to potentially block indefinitely or set a timeout.

by ttcoder, 10 years ago

Opening this in MediaPlayer makes media_addon_server go into a tight loop, on 2 different setups.. Anyone else seeing this ?

comment:14 by ttcoder, 10 years ago

Not working specifically on this ticket this week, but while testing new AAC/AC-3 support added in hrev48554 changeset (thanks Adrien!, AC-3 works great), I came up with the attached file, that reliably makes the system go into the weeds :

It freezes Deskbar and MediaPlayer, /bin/top shows 100% CPU being shared between media_addon_server and MediaPlayer, and sometimes after a couple minutes the tight loop concludes with a crash in the resampler or by the "oops" assert; this is on both my laptop with an old 46168 and the desktop with hrev48562, so I guess other people will be able to reproduce this ?

Maybe this is another data point to more easily debug the "oops" problem.

I'm not ruling out the file itself being corrupt in some way though, since it was not generated by MediaConverter but by my own app; but that would then show a vulnerability if nothing else..

EDIT: the file is indeed corrupt, won't play in windows7 in VLC 2.x or mediaplayer; I was experimenting with a custom-made resampler that downsamples and that converts from B_AUDIO_INT to B_AUDIO_FLOAT and it evidently took me a few iterations to get it right ;-) So it seems to the media_addon_server is vulnerable to float sample values getting out of their legal [0.0 .. 1.0] range..?

Version 1, edited 10 years ago by ttcoder (previous) (next) (diff)

comment:15 by Barrett, 9 years ago

Owner: changed from nobody to Barrett
Status: newassigned

comment:16 by ttcoder, 9 years ago

Had some time this morning to do a fresh install of 49477 on the old PentiumM laptop. No change, same symptoms.

Then I pkgman up'ed to today's hrev49559 and... All of a sudden it is rock solid! No matter how many tracks I throw at Media Player (and media addon server), it keeps working.

There's been a half dozen changes in that range. Maybe you fixed it for good Dario ? I gotta test on my main laptop (the one this ticket originates from) next to be sure but it looks like it might get it back on its feet :-)

comment:17 by Barrett, 9 years ago

The issue showed some signs of buffers not being handled synchronously, in the logs there were some mess in the order they were instantiated. I was working on the BBufferGroup to fix this only with partial success as you know, but there's a strong possibility that my ControlLoop() rewrite restored the supposed way to work. Basically the old algorithm wasn't linear and didn't ensure we have a 1:1 ratio with port reads / events handled, I am waiting so that you cant test also the other system :-), and i will be happy to close this ticket and related ones.

comment:18 by Barrett, 9 years ago

Any chance to test it on the other system? :-)

comment:19 by ttcoder, 9 years ago

Waiting for a fix to #12298 :-/ That's my "safe queen" laptop, I want to keep installations to a minimum on that one.. Especially now that its ethernet port is fried (gotta get it back online with Wi-Fi, wish me luck).

comment:20 by ttcoder, 9 years ago

Ticket should be closed -- Finally got the resolve to do the jump to hrev49627 on my thinkpad and not regretting it, lots of things work better :-) Including media, which has been 100% stable for the past hour or so and through a couple reboots. Crossing fingers, looks like you fixed it for me. And we're not hearing either from others on the "oops" assert. Some months ago new tickets (dupes) were created monthly about that assert, but not any more. Good sign.

/me licks his wounds, recovers from a few months of PTSD and takes advantage of his newfound stability *G*.

Rock on !

EDIT: update a week later: this is probably not worth delaying the beta for, so just posting this silent update that it's improved but not completely fixed -- I still see the "oops" assert sometimes; but the media_server can be restarted successfully when that happens, contrarily to the previous situation where it was about impossible to recover without a reboot, so no contest, big improvement here :-)

Last edited 9 years ago by ttcoder (previous) (diff)

comment:21 by Barrett, 9 years ago

Nice :-), since it was a beta blocker it looks like a good step on. Fixed in hrev49503.

comment:22 by Barrett, 9 years ago

Resolution: fixed
Status: assignedclosed
Note: See TracTickets for help on using tickets.