Opened 5 years ago

Closed 4 years ago

#15728 closed bug (not reproducible)

Crash on free in PainterAggInterface

Reported by: humdinger Owned by: axeld
Priority: blocker Milestone: R1/beta3
Component: Servers/app_server Version: R1/Development
Keywords: Cc: ttcoder
Blocked By: Blocking: #16246, #16367
Platform: All

Description

This is hrev53888, 32bit (VESA)

Got this crash (Web+ may have something to do with it, it was loading a page):

thread 50918: w:50824:offscreen 
	state: Exception (Segment violation)

	Frame		IP			Function Name
	-----------------------------------------------
	0x70799498	0x18b5024	BPrivate::processHeap::free(void*) + 0x64 
[...]
		Frame memory:
			[0x70799470]  .I...... ....k..   9c 49 8e 01 04 00 00 00 20 16 ea 19 09 6b 8b 01
			[0x70799480]  .I....yp..yp....   9c 49 8e 01 94 94 79 70 90 94 79 70 03 00 00 00
			[0x70799490]  ....;...           08 08 08 02 3b 00 00 00
	0x707994c8	0x18b6ba5	free + 0xa9 
	0x70799500	0x1826187	operator delete [](void) + 0x1f 
	0x70799530	0x1bcc2aa	_._19PainterAggInterface + 0x14e 
	0x70799570	0x1ba400b	_._7Painter + 0x63 
	0x707995a0	0x1b96a75	_._13DrawingEngine + 0x49 
	0x70799720	0x1b941ef	_RenderSource() + 0x3ff 
	0x707997b0	0x1b9286c	AlphaMask::_Generate() + 0x80 
	0x70799820	0x1b925c6	AlphaMask::SetCanvasGeometry(IntPoint, IntRect) + 0x1c2 
	0x707998b0	0x1b588e2	ServerWindow::_UpdateDrawState(View*) + 0x102 
	0x70799bf0	0x1b4f745	ServerWindow::_DispatchViewMessage(int32, BPrivate::LinkReceiver&) + 0x2ebd 
	0x70799d20	0x1b4c7c9	ServerWindow::_DispatchMessage(int32, BPrivate::LinkReceiver&) + 0x1251 
	0x70799da0	0x1b5830e	ServerWindow::_MessageLooper() + 0x256 
	0x70799dd0	0x1b2b0a6	MessageLooper::_message_thread(void*) + 0x26 
	0x70799df8	0x182fccb	thread_entry + 0x27 
	00000000	0x600aa258	commpage_thread_exit + 0 

Not sure the ticket's summary makes sense, please correct.

Attachments (5)

app_server-685-debug-19-02-2020-16-44-07.report (432.3 KB ) - added by humdinger 5 years ago.
complete debug report
app_server-844-debug-05-07-2020-00-24-43.report (50.9 KB ) - added by waddlesplash 4 years ago.
app_server-564-debug-13-07-2020-03-40-57.report (944.5 KB ) - added by Pete 4 years ago.
app_server-561-debug-15-07-2020-05-21-32.report (331.6 KB ) - added by Pete 4 years ago.
report from app-server crash when switching workspaces
app_server-561-debug-22-07-2020-23-12-58.report (467.7 KB ) - added by Pete 4 years ago.
crash on trying to play YouTube video

Change History (39)

by humdinger, 5 years ago

complete debug report

comment:1 by waddlesplash, 5 years ago

Highly likely it's yet another heap corruption problem in app_server. We really could stand to investigate those...

comment:2 by waddlesplash, 4 years ago

Blocking: 16246 added

comment:3 by ttcoder, 4 years ago

Landing here by way of #16246. Just occured to me in beta2+111, with Web+ starting to play from youtube. Edit: could not use the Ctrl-Alt-Del combo, even if held for several seconds ; Alt-Sysreq-D took me to KDL though, allowing to reboot.

Last edited 4 years ago by ttcoder (previous) (diff)

in reply to:  3 comment:4 by X512, 4 years ago

Replying to ttcoder:

Landing here by way of #16246. Just occured to me in beta2+111

Have you applied https://git.haiku-os.org/haiku/commit/?h=r1beta2&id=02b948fda7ac7463d57b2bbeda7913ef4f9c72cc?

comment:5 by ttcoder, 4 years ago

Cc: ttcoder added

Great insight X512. I pkgman updated to apply your patch (beta2/115) and now the video plays to the end, no crash at all.

Will report if I have a change of heart, but my working assumption now is that the bug is fixed in the latest commit of beta2 branch.

/me kinda hopes that X512 will get interested in the media_server some day and kick ass there too *g*

comment:6 by waddlesplash, 4 years ago

Unfortunately that patch is not sufficient to fix the crash entirely; it does seem to correct some kind of problem, but I got a crash with a very similar stacktrace yesterday. Once I manage to get internet back on this machine, I'll upload it...

comment:7 by waddlesplash, 4 years ago

Here's a crash I got yesterday, on (as you can see) hrev54390.

I read through the code again and I am pretty baffled as to how this is occuring. Perhaps the "shape" pointer is garbage as this is a UaF somehow? But then the ReleaseReference should have crashed. I also looked through all other users of AlphaMask and all of them appear to be doing ref-counting correctly, or using BReference...

comment:8 by waddlesplash, 4 years ago

Blocking: 16367 added

comment:9 by Pete, 4 years ago

I'll report my experiences, in case it adds anything (I originally appended to #15178, which looked similar, but apparently is not.)

Since installing beta 2 (hrev54154-110) I've had -- I think -- three app-server crashes. Two were as above -- trying to play a Youtube video. The other was after I had just unmounted a BFS USB stick; it crashed when I pulled the stick out. No debug reports -- sorry.

As it was white-screen, but not KDL, I tried to resume, but all that happened was that when I did finally hit the off button and rebooted, all my open folders were in the first Workspace! The third time I went straight to the off button, and folders remained in place.

I haven't managed to update, so I can't report on 115, but Jason Dodd on the mailing list says it still happens to him after updating.

comment:10 by Pete, 4 years ago

Had another crash, and this time got to save a report. Can't make sense of it myself, but I hope someone can. For a start the time doesn't make sense. I suppose it's UTC, but the crash was at ~7:15pm not at 40 minutes past the hour. And I'm pretty sure there were other windows open that don't appear. The WebPositive access was yesterday --on a previous boot!

Anyway, what happened this time was that I was playing an ogg file in another workspace. The music finished, so I wanted to go back to that workspace to close the player. clicked in the Workspaces window and the crash was immediate.

by Pete, 4 years ago

report from app-server crash when switching workspaces

comment:11 by Pete, 4 years ago

I've attached one more report. Again, it was when trying to switch Workspaces, but this report seems to match the events better. The time is correct, and the stack shows that Workspaces was involved.

comment:12 by diver, 4 years ago

This is a completely different crash in Desktop::SetFocusWindow. Have you looked if it was already reported?

in reply to:  12 comment:13 by X512, 4 years ago

Replying to diver:

This is a completely different crash in Desktop::SetFocusWindow.

It is similar to #6484.

in reply to:  12 comment:14 by Pete, 4 years ago

Replying to diver:

This is a completely different crash in Desktop::SetFocusWindow. Have you looked if it was already reported?

Are you sure? There actually seem to be a number of tickets reporting app_server crashes that all relate to beta 2. I initiallly reported the crashes I'm getting in #15178, but Waddlesplash redirected me here. (I've had what appears to be the same crash caused by different actions -- changing workspaces, accessing Youtube, and maybe other things. I'm assuming one underlying cause.) ttcoder came here via #16246. #6484 sounds similar, but it was 10 years ago!

comment:15 by Pete, 4 years ago

According to Waddlesplash this is where YouTube crashes should ber reported (not 15178 which seemed appropriate to me), so I'll attach a report for another one. This time I was on the forum, reading https://discuss.haiku-os.org/t/outside-of-haiku-what-are-you-doing/9341/76 and followed the link to YouTube in that post. Others reported it unplayable, but I thought I'd try. It churned on loading for a minute or so, then crashed.

by Pete, 4 years ago

crash on trying to play YouTube video

comment:16 by ttcoder, 4 years ago

Now, I'm really puzzled... I tried to make a "franken-rev", inserting an older app_server (53894) into Haiku R1/beta2, by way of an hpkg installed to /system/packages to override the "stock" app_server.

Everything went fine for a couple hours, and I was feeling very smug and happy <g>.. and then app_server crashed, with the usual "DWARF.." dump on a white screen.

I'm baffled here... When running hrev53894 proper, we never get any app_server crashes like that, so why would that same app_server start crashing when inserted into R1/beta2... Yet Another Heisenbug... <s>

Last edited 4 years ago by ttcoder (previous) (diff)

comment:17 by waddlesplash, 4 years ago

Did you double-check that the app_server you created was actually being used? I.e. catattr SYS:PACKAGE /path/to/appserver

comment:18 by ttcoder, 4 years ago

[yeah I know it's a delicate matter, with the need for the hpkg to contain the "system" flag, get the PackageInfo right etc, and thus I checked] both before rebooting and after rebooting yes, checked the mtime, the file size (the app_server extracted from the old nightly is a few thousand bytes smaller), the pkg attribute (now easily available in the third tab of the "Get Info" window, nice!). If that wasn't enough, I also tried running AutoCast, and see that my "fake transparency" hack is broken, for the first time ever : there is now a blank white background instead of "see through", so that's additional proof (likely due to running the newer libbe.so with an older app_server, which breaks X512's transparency clean-ups). I'll check a third time next time I boot that partition to make double extra sure :-) but it's a foregone conclusion. The heap corruption bug might have been there several months before, and it's only running it in Beta2 etc that it gets triggered (could be a timing problem, a "race" behaving differently : the beta2 feels faster, more snappy than nightlies, due to the Release build profile probably).

[Edit: anyway, after two hours of random testing, I went to my "staple" youtube test : "No Man's Sky Gameplay Trailer", in full-window mode (no YT interface), tried clicking around as it was playing happily, and finally app_server crashed a few seconds before the end]

Edit2: the crash got captured in syslog:

KERN: debug_server: Thread 19325 entered the debugger: General protection fault
KERN: stack trace, current PC 0x33815c0b60  _ZN8BPrivate11processHeap4freeEPv + 0x30:
KERN:   (0x7faafe9e2ce0)  0x33815c1d02  free + 0x42
KERN:   (0x7faafe9e2d00)  0x9cb1db77c  _ZN10shape_dataD0Ev + 0x3c
KERN:   (0x7faafe9e2d20)  0x49bf6383ae  _ZN14BReferenceable16ReleaseReferenceEv + 0x1e
KERN:   (0x7faafe9e2d40)  0x9cb234246  _ZN14ShapeAlphaMaskD1Ev + 0x26
..
KERN:   (0x7faafe9e31a0)  0x9cb1fd5ee  _ZN12ServerWindow14_MessageLooperEv + 0x23e
KERN:   (0x7faafe9e3210)  0x9cb1df9ea  _ZN13MessageLooper15_message_threadEPv + 0xa
KERN:   (0x7faafe9e3220)  0x3381534f69  thread_entry + 0x19
KERN: <BEEP>
Last edited 4 years ago by ttcoder (previous) (diff)

comment:19 by ttcoder, 4 years ago

Confirming, whoever looks into this might want to review the change-logs before 53894. It's at least the second time I see a station running hrev53894 with huge memory leaks ; this one had 2.3 GB (two giga bytes) in app_server after 14 days of use, and had to reboot after a hard "freeze" :

STC:Thu Jul 30 18:48:18 2020  Memory usage has gone up from 2685423616 to 2954039296 bytes (86.1%).
STC:Thu Jul 30 18:48:18 2020  1) 2353147904 bytes used by /boot/system/servers/app_server (team 440)

Questions that come to mind:

  • is that memory leak a separate issue, or did a single change-set cause both the non-released mem and the PainterAggInterface heap issue ?
  • is that leaked heap, or leaked areas ? (note to self: SC reports don't mention the distinction, but my log does a "listarea heap" so try to collect that)

EDIT: there's been AlphaMask changes added shortly after beta1 (52295), e.g. a "new AlphaMask" in hrev52327, though that one seems to be properly matched with a "release reference" and not leaked.. And another "new AlphaMask" in hrev52326 which does not seem to be matched with a release reference

EDIT2: also, this ticket is listed as "normal" priority on "Unscheduled" milestone, which seems... A little under-handed.

Last edited 4 years ago by ttcoder (previous) (diff)

comment:20 by ttcoder, 4 years ago

This time I tried a frankenstein-rev with a beta1 app_server inserted into R1/beta2, but it locks up solid as the desktop appears. I could just drop into KDL (only from the laptop's built-in keyboard). Invoking syslog+tail I see no message related to an app_server crash or launch_daemon, the last few lines are intel_extreme tracing. And the syslog is not preserved after a reboot, despite waiting 30 seconds.

So there must be dependancies in libbe.so or elsewhere that are not satisfied, if going back that far (to beta1). Shame, I bet it would have solved the memory leak and crashing, sigh.

in reply to:  19 comment:21 by bitigchi, 4 years ago

Replying to ttcoder:

EDIT2: also, this ticket is listed as "normal" priority on "Unscheduled" milestone, which seems... A little under-handed.

IMO all crashes and KDL's should be tagged as high priority and set to next available milestone. It's not good QA practice at all to leave these tickets unscheduled.

Maybe a script could be written to automate this, e.g. words like "crash" and "KDL" can automatically set ticket attributes (if not set already).

comment:22 by pulkomandy, 4 years ago

Milestone: UnscheduledR1/beta3
Resolution: fixed
Status: newclosed

comment:23 by pulkomandy, 4 years ago

Resolution: fixed
Status: closedreopened

comment:24 by pulkomandy, 4 years ago

So there must be dependancies in libbe.so or elsewhere that are not satisfied, if going back that far (to beta1). Shame, I bet it would have solved the memory leak and crashing, sigh.

Well, older versions not having the alpha mask code surely wouldn't hit a bug in the alpha mask code. But they wouldn't work, either, because the interface kit makes use of the feature. Makes sense, I guess...

Also, it doesn't help much to have mix-and-matched tests here, because they are hard to reproduce and quite likely to add more problems than they solve (I understand it could be useful for you to find a workable setup, still).

So, what we have so far:

  • Crashes always in the memory allocation code, which hints to a heap corruption
  • Crashes often in the AlphaMask code, which is mostly exercised by WebPositive
  • In all reports attached, there are a lot of "some BLocker" in app_server, so the whole thing is under some stress (lots of open windows or open tabs in Web+). It doesn't seem to happen in a generally idle system.

It would be nice if we could reproduce this in a more predictible way, maybe with a test app stressing the use of alpha masks?

comment:25 by pulkomandy, 4 years ago

comment:26 by ttcoder, 4 years ago

so the whole thing is under some stress (lots of open windows or open tabs in Web+). (...) It would be nice if we could reproduce this in a more predictible way, maybe with a test app stressing the use of alpha masks?

Do the youtube "thumbnails" involve alpha masks by any chance? I'm asking because... Today I had time to boot into beta2 (twice), so I gave it a go... Reproduced the crash twice in two attempts, hovering the mouse above (among other things) youtube thumbnails, during playback and especially after the video is done:

  • booted into beta2 unmodified ("stock" app_server)
  • immediately launched Web+
  • Command-T to create a _second_ tab
  • typed the URL for the full-window "no man's sky" trailer : http://www.youtube.com/embed/nLtmEjqzg7M
  • let it play to the end (didn't find a way to reproduce the crash otherwise, maybe that's significant)
  • let W+ recover after the end (always takes a while)
  • click the circular arrow icon at bottom left, to restart playing.
  • there's some sort of visual bug, where W+ displays a rotation "please wait" symbol (shaped like a round arrow) on a pitch black full-window background, but the symbol moves around the window quickly, as if it had a wild BView.Transform() call. That state remains for a good while, allowing time to do this:
  • hover the mouse above the red/gray progress bar, left to right, then back left, to show a maximum of thumbnails from the video, hovering even more intensely than I did during playback.
  • crash

Observation: at the debugger prompt that comes up, I typed "threads" to see what threads are running in app_server, and both times I saw a thread named "Reason: xxx":

  • "w:908:offscreen ("Reason : _numblock >0")
  • "w:xxx:offscreen ("Reason : Segment Violation")

EDIT: yup it got captured in the syslog (retrieved over from my beta1 partition) each time, e.g.:

KERN: 1000: DEBUGGER: _numBlocks > 0
KERN: debug_server: Thread 1000 entered the debugger: Debugger call: `_numBlocks > 0'
KERN: stack trace, current PC 0x1d7240d01c1  _kern_debugger + 0x9:
KERN:   (0x7f30c3894cc0)  0x1d72415bd02  free + 0x42
KERN:   (0x7f30c3894ce0)  0x118e7531921  _ZN7PainterD0Ev + 0x11 (..)
KERN:   (0x7f30c3894d00)  0x118e7522480  _ZN13DrawingEngineD2Ev + 0x30 (..)
KERN:   (0x7f30c3894f10)  0x118e751b81b  _ZN9AlphaMask9_GenerateEv + 0x4b (..)

So if I'm not completely mis-understanding this, it's possible that the sequence of events is 1) Web+ crashes, 2) debbugger attempts to open its normal BAlert, 3) that crashes the app_server 4) debugger is invoked a second time, but this time in "console" mode. Though I suppose that sequence does not make a huge difference compared to the previously assumed one.

Well maybe it does -- what if I configure "/boot/home/config/settings/system/debug_server/settings" with a "WebPositive : log" line, or some such ? Will try that next. I'll have less time to dedicate to this from now on though, so might leave it to others. We don't want to make the ticket unbearably long anyway. EDIT: I'm not set up for compiling Haiku ATM.

EDIT:

Also see #16489 for a possible reproducible case (not for everyone ?)

Last edited 4 years ago by ttcoder (previous) (diff)

comment:27 by pulkomandy, 4 years ago

Well...

there's some sort of visual bug, where W+ displays a rotation "please wait" symbol (shaped like a round arrow) on a pitch black full-window background, but the symbol moves around the window quickly, as if it had a wild BView.Transform() call.

Yes, there is a known issue with transforms in our WebKit. Clearly it is not my domain of expertise, I wrote the code trying to use what I remember of math courses I took 10+ years ago, can't get things right this way. However it's harmless, it just draw things at the wrong position.

1) Web+ crashes

I don't think it does. app_server crashes, and the two threads you listed are named w:908:offscreen and w:908:offscreen. These are the two threads that crashed: one because of a segment violation and the other because of heap corruption, apparently.

These "offscreen" threads are created when using a view that draws to an offscreen bitmap (which Web+ does a lot).

Alpha masks were added to app_server specifically for use in WebKit, and I think no other apps (besides a few test ones) are using them. I don't know specifically about youtube, but they are used in many places in WWebKit for drawing, so it's quite likely to be used there.

EDIT: I'm not set up for compiling Haiku ATM.

I have not hit this crash a single time so far, so I don't really know how I can help. But I guess I could try going more to youtube.

comment:28 by pulkomandy, 4 years ago

Priority: normalblocker

comment:29 by pulkomandy, 4 years ago

With the current version of youtube it is not possible anymore to play videos. Are there other websites that reproduce this problem?

comment:30 by humdinger, 4 years ago

Are there other websites that reproduce this problem?

Not that I know of...

comment:31 by pulkomandy, 4 years ago

In that case, shall we move this issue out of the beta3 release or close as not reproducible for now?

comment:32 by waddlesplash, 4 years ago

Considering this is another AlphaMask-related crash, it seems probable that it is or was related to #16489, which has a known cause thanks to the guarded heap, so perhaps that one should get more attention then.

comment:33 by pulkomandy, 4 years ago

#16489 is using current webkit builds and current nightlies, which use a completely different code path (new drawing modes implemented by KapiX). And we know that the same website used in #16489 does not crash when using current releases.

So, #16489 is a regression from beta2, while this ticket is about code that was already there in beta2. It is safe to conclude that they are unrelated (and we probably have, or had, multiple such problems in app_server).

Also, #16489 does not crash app_server anymore in current nightlies. It now runs out of ports.

comment:34 by pulkomandy, 4 years ago

Resolution: not reproducible
Status: reopenedclosed

Closing as not reproductible because the original issue here is not reproductible due to Youtube not allowing video playback in WebPositive at all anymore.

Sorry to everyone else adding different and unrelated crash reports and trying weird mixes of different haiku revisions, but you make it impossible to follow what's going on. And to Haiku developers insisting that all tickets that involve webpositive or app_server are probably related: that's unlikely. Both are big pieces of code and can very well have multiple bugs.

If you still have a crash in some case, unless you are really sure it is exactly the same problem as the original report, please open a separate ticket and explain which website you were navigating and what you did.

In general, it's always easier to close a ticket as duplicate than untangling a long stream of comments in a single ticket discussing different and possibly unrelated issues.

Note: See TracTickets for help on using tickets.