Opened 5 years ago
Closed 4 years ago
#15728 closed bug (not reproducible)
Crash on free in PainterAggInterface
Reported by: | humdinger | Owned by: | axeld |
---|---|---|---|
Priority: | blocker | Milestone: | R1/beta3 |
Component: | Servers/app_server | Version: | R1/Development |
Keywords: | Cc: | ttcoder | |
Blocked By: | Blocking: | #16246, #16367 | |
Platform: | All |
Description
This is hrev53888, 32bit (VESA)
Got this crash (Web+ may have something to do with it, it was loading a page):
thread 50918: w:50824:offscreen state: Exception (Segment violation) Frame IP Function Name ----------------------------------------------- 0x70799498 0x18b5024 BPrivate::processHeap::free(void*) + 0x64 [...] Frame memory: [0x70799470] .I...... ....k.. 9c 49 8e 01 04 00 00 00 20 16 ea 19 09 6b 8b 01 [0x70799480] .I....yp..yp.... 9c 49 8e 01 94 94 79 70 90 94 79 70 03 00 00 00 [0x70799490] ....;... 08 08 08 02 3b 00 00 00 0x707994c8 0x18b6ba5 free + 0xa9 0x70799500 0x1826187 operator delete [](void) + 0x1f 0x70799530 0x1bcc2aa _._19PainterAggInterface + 0x14e 0x70799570 0x1ba400b _._7Painter + 0x63 0x707995a0 0x1b96a75 _._13DrawingEngine + 0x49 0x70799720 0x1b941ef _RenderSource() + 0x3ff 0x707997b0 0x1b9286c AlphaMask::_Generate() + 0x80 0x70799820 0x1b925c6 AlphaMask::SetCanvasGeometry(IntPoint, IntRect) + 0x1c2 0x707998b0 0x1b588e2 ServerWindow::_UpdateDrawState(View*) + 0x102 0x70799bf0 0x1b4f745 ServerWindow::_DispatchViewMessage(int32, BPrivate::LinkReceiver&) + 0x2ebd 0x70799d20 0x1b4c7c9 ServerWindow::_DispatchMessage(int32, BPrivate::LinkReceiver&) + 0x1251 0x70799da0 0x1b5830e ServerWindow::_MessageLooper() + 0x256 0x70799dd0 0x1b2b0a6 MessageLooper::_message_thread(void*) + 0x26 0x70799df8 0x182fccb thread_entry + 0x27 00000000 0x600aa258 commpage_thread_exit + 0
Not sure the ticket's summary makes sense, please correct.
Attachments (5)
Change History (39)
by , 5 years ago
Attachment: | app_server-685-debug-19-02-2020-16-44-07.report added |
---|
comment:1 by , 5 years ago
Highly likely it's yet another heap corruption problem in app_server. We really could stand to investigate those...
comment:2 by , 5 years ago
Blocking: | 16246 added |
---|
follow-up: 4 comment:3 by , 4 years ago
Landing here by way of #16246. Just occured to me in beta2+111, with Web+ starting to play from youtube. Edit: could not use the Ctrl-Alt-Del combo, even if held for several seconds ; Alt-Sysreq-D took me to KDL though, allowing to reboot.
comment:4 by , 4 years ago
Replying to ttcoder:
Landing here by way of #16246. Just occured to me in beta2+111
Have you applied https://git.haiku-os.org/haiku/commit/?h=r1beta2&id=02b948fda7ac7463d57b2bbeda7913ef4f9c72cc?
comment:5 by , 4 years ago
Cc: | added |
---|
Great insight X512. I pkgman update
d to apply your patch (beta2/115) and now the video plays to the end, no crash at all.
Will report if I have a change of heart, but my working assumption now is that the bug is fixed in the latest commit of beta2 branch.
/me kinda hopes that X512 will get interested in the media_server some day and kick ass there too *g*
comment:6 by , 4 years ago
Unfortunately that patch is not sufficient to fix the crash entirely; it does seem to correct some kind of problem, but I got a crash with a very similar stacktrace yesterday. Once I manage to get internet back on this machine, I'll upload it...
by , 4 years ago
Attachment: | app_server-844-debug-05-07-2020-00-24-43.report added |
---|
comment:7 by , 4 years ago
Here's a crash I got yesterday, on (as you can see) hrev54390.
I read through the code again and I am pretty baffled as to how this is occuring. Perhaps the "shape" pointer is garbage as this is a UaF somehow? But then the ReleaseReference should have crashed. I also looked through all other users of AlphaMask and all of them appear to be doing ref-counting correctly, or using BReference...
comment:8 by , 4 years ago
Blocking: | 16367 added |
---|
comment:9 by , 4 years ago
I'll report my experiences, in case it adds anything (I originally appended to #15178, which looked similar, but apparently is not.)
Since installing beta 2 (hrev54154-110) I've had -- I think -- three app-server crashes. Two were as above -- trying to play a Youtube video. The other was after I had just unmounted a BFS USB stick; it crashed when I pulled the stick out. No debug reports -- sorry.
As it was white-screen, but not KDL, I tried to resume, but all that happened was that when I did finally hit the off button and rebooted, all my open folders were in the first Workspace! The third time I went straight to the off button, and folders remained in place.
I haven't managed to update, so I can't report on 115, but Jason Dodd on the mailing list says it still happens to him after updating.
by , 4 years ago
Attachment: | app_server-564-debug-13-07-2020-03-40-57.report added |
---|
comment:10 by , 4 years ago
Had another crash, and this time got to save a report. Can't make sense of it myself, but I hope someone can. For a start the time doesn't make sense. I suppose it's UTC, but the crash was at ~7:15pm not at 40 minutes past the hour. And I'm pretty sure there were other windows open that don't appear. The WebPositive access was yesterday --on a previous boot!
Anyway, what happened this time was that I was playing an ogg file in another workspace. The music finished, so I wanted to go back to that workspace to close the player. clicked in the Workspaces window and the crash was immediate.
by , 4 years ago
Attachment: | app_server-561-debug-15-07-2020-05-21-32.report added |
---|
report from app-server crash when switching workspaces
comment:11 by , 4 years ago
I've attached one more report. Again, it was when trying to switch Workspaces, but this report seems to match the events better. The time is correct, and the stack shows that Workspaces was involved.
follow-ups: 13 14 comment:12 by , 4 years ago
This is a completely different crash in Desktop::SetFocusWindow
. Have you looked if it was already reported?
comment:13 by , 4 years ago
comment:14 by , 4 years ago
Replying to diver:
This is a completely different crash in
Desktop::SetFocusWindow
. Have you looked if it was already reported?
Are you sure? There actually seem to be a number of tickets reporting app_server crashes that all relate to beta 2. I initiallly reported the crashes I'm getting in #15178, but Waddlesplash redirected me here. (I've had what appears to be the same crash caused by different actions -- changing workspaces, accessing Youtube, and maybe other things. I'm assuming one underlying cause.) ttcoder came here via #16246. #6484 sounds similar, but it was 10 years ago!
comment:15 by , 4 years ago
According to Waddlesplash this is where YouTube crashes should ber reported (not 15178 which seemed appropriate to me), so I'll attach a report for another one. This time I was on the forum, reading https://discuss.haiku-os.org/t/outside-of-haiku-what-are-you-doing/9341/76 and followed the link to YouTube in that post. Others reported it unplayable, but I thought I'd try. It churned on loading for a minute or so, then crashed.
by , 4 years ago
Attachment: | app_server-561-debug-22-07-2020-23-12-58.report added |
---|
crash on trying to play YouTube video
comment:16 by , 4 years ago
Now, I'm really puzzled... I tried to make a "franken-rev", inserting an older app_server (53894) into Haiku R1/beta2, by way of an hpkg installed to /system/packages to override the "stock" app_server.
Everything went fine for a couple hours, and I was feeling very smug and happy <g>.. and then app_server crashed, with the usual "DWARF.." dump on a white screen.
I'm baffled here... When running hrev53894 proper, we never get any app_server crashes like that, so why would that same app_server start crashing when inserted into R1/beta2... Yet Another Heisenbug... <s>
comment:17 by , 4 years ago
Did you double-check that the app_server you created was actually being used? I.e. catattr SYS:PACKAGE /path/to/appserver
comment:18 by , 4 years ago
[yeah I know it's a delicate matter, with the need for the hpkg to contain the "system" flag, get the PackageInfo right etc, and thus I checked] both before rebooting and after rebooting yes, checked the mtime, the file size (the app_server extracted from the old nightly is a few thousand bytes smaller), the pkg attribute (now easily available in the third tab of the "Get Info" window, nice!). If that wasn't enough, I also tried running AutoCast, and see that my "fake transparency" hack is broken, for the first time ever : there is now a blank white background instead of "see through", so that's additional proof (likely due to running the newer libbe.so with an older app_server, which breaks X512's transparency clean-ups). I'll check a third time next time I boot that partition to make double extra sure :-) but it's a foregone conclusion. The heap corruption bug might have been there several months before, and it's only running it in Beta2 etc that it gets triggered (could be a timing problem, a "race" behaving differently : the beta2 feels faster, more snappy than nightlies, due to the Release build profile probably).
[Edit: anyway, after two hours of random testing, I went to my "staple" youtube test : "No Man's Sky Gameplay Trailer", in full-window mode (no YT interface), tried clicking around as it was playing happily, and finally app_server crashed a few seconds before the end]
Edit2: the crash got captured in syslog:
KERN: debug_server: Thread 19325 entered the debugger: General protection fault KERN: stack trace, current PC 0x33815c0b60 _ZN8BPrivate11processHeap4freeEPv + 0x30: KERN: (0x7faafe9e2ce0) 0x33815c1d02 free + 0x42 KERN: (0x7faafe9e2d00) 0x9cb1db77c _ZN10shape_dataD0Ev + 0x3c KERN: (0x7faafe9e2d20) 0x49bf6383ae _ZN14BReferenceable16ReleaseReferenceEv + 0x1e KERN: (0x7faafe9e2d40) 0x9cb234246 _ZN14ShapeAlphaMaskD1Ev + 0x26 .. KERN: (0x7faafe9e31a0) 0x9cb1fd5ee _ZN12ServerWindow14_MessageLooperEv + 0x23e KERN: (0x7faafe9e3210) 0x9cb1df9ea _ZN13MessageLooper15_message_threadEPv + 0xa KERN: (0x7faafe9e3220) 0x3381534f69 thread_entry + 0x19 KERN: <BEEP>
follow-up: 21 comment:19 by , 4 years ago
Confirming, whoever looks into this might want to review the change-logs before 53894. It's at least the second time I see a station running hrev53894 with huge memory leaks ; this one had 2.3 GB (two giga bytes) in app_server after 14 days of use, and had to reboot after a hard "freeze" :
STC:Thu Jul 30 18:48:18 2020 Memory usage has gone up from 2685423616 to 2954039296 bytes (86.1%). STC:Thu Jul 30 18:48:18 2020 1) 2353147904 bytes used by /boot/system/servers/app_server (team 440)
Questions that come to mind:
- is that memory leak a separate issue, or did a single change-set cause both the non-released mem and the PainterAggInterface heap issue ?
- is that leaked heap, or leaked areas ? (note to self: SC reports don't mention the distinction, but my log does a "listarea heap" so try to collect that)
EDIT: there's been AlphaMask changes added shortly after beta1 (52295), e.g. a "new AlphaMask" in hrev52327, though that one seems to be properly matched with a "release reference" and not leaked.. And another "new AlphaMask" in hrev52326 which does not seem to be matched with a release reference
EDIT2: also, this ticket is listed as "normal" priority on "Unscheduled" milestone, which seems... A little under-handed.
comment:20 by , 4 years ago
This time I tried a frankenstein-rev with a beta1 app_server inserted into R1/beta2, but it locks up solid as the desktop appears. I could just drop into KDL (only from the laptop's built-in keyboard). Invoking syslog+tail I see no message related to an app_server crash or launch_daemon, the last few lines are intel_extreme tracing. And the syslog is not preserved after a reboot, despite waiting 30 seconds.
So there must be dependancies in libbe.so or elsewhere that are not satisfied, if going back that far (to beta1). Shame, I bet it would have solved the memory leak and crashing, sigh.
comment:21 by , 4 years ago
Replying to ttcoder:
EDIT2: also, this ticket is listed as "normal" priority on "Unscheduled" milestone, which seems... A little under-handed.
IMO all crashes and KDL's should be tagged as high priority and set to next available milestone. It's not good QA practice at all to leave these tickets unscheduled.
Maybe a script could be written to automate this, e.g. words like "crash" and "KDL" can automatically set ticket attributes (if not set already).
comment:22 by , 4 years ago
Milestone: | Unscheduled → R1/beta3 |
---|---|
Resolution: | → fixed |
Status: | new → closed |
comment:23 by , 4 years ago
Resolution: | fixed |
---|---|
Status: | closed → reopened |
comment:24 by , 4 years ago
So there must be dependancies in libbe.so or elsewhere that are not satisfied, if going back that far (to beta1). Shame, I bet it would have solved the memory leak and crashing, sigh.
Well, older versions not having the alpha mask code surely wouldn't hit a bug in the alpha mask code. But they wouldn't work, either, because the interface kit makes use of the feature. Makes sense, I guess...
Also, it doesn't help much to have mix-and-matched tests here, because they are hard to reproduce and quite likely to add more problems than they solve (I understand it could be useful for you to find a workable setup, still).
So, what we have so far:
- Crashes always in the memory allocation code, which hints to a heap corruption
- Crashes often in the AlphaMask code, which is mostly exercised by WebPositive
- In all reports attached, there are a lot of "some BLocker" in app_server, so the whole thing is under some stress (lots of open windows or open tabs in Web+). It doesn't seem to happen in a generally idle system.
It would be nice if we could reproduce this in a more predictible way, maybe with a test app stressing the use of alpha masks?
comment:26 by , 4 years ago
so the whole thing is under some stress (lots of open windows or open tabs in Web+). (...) It would be nice if we could reproduce this in a more predictible way, maybe with a test app stressing the use of alpha masks?
Do the youtube "thumbnails" involve alpha masks by any chance? I'm asking because... Today I had time to boot into beta2 (twice), so I gave it a go... Reproduced the crash twice in two attempts, hovering the mouse above (among other things) youtube thumbnails, during playback and especially after the video is done:
- booted into beta2 unmodified ("stock" app_server)
- immediately launched Web+
- Command-T to create a _second_ tab
- typed the URL for the full-window "no man's sky" trailer : http://www.youtube.com/embed/nLtmEjqzg7M
- let it play to the end (didn't find a way to reproduce the crash otherwise, maybe that's significant)
- let W+ recover after the end (always takes a while)
- click the circular arrow icon at bottom left, to restart playing.
- there's some sort of visual bug, where W+ displays a rotation "please wait" symbol (shaped like a round arrow) on a pitch black full-window background, but the symbol moves around the window quickly, as if it had a wild BView.Transform() call. That state remains for a good while, allowing time to do this:
- hover the mouse above the red/gray progress bar, left to right, then back left, to show a maximum of thumbnails from the video, hovering even more intensely than I did during playback.
- crash
Observation: at the debugger prompt that comes up, I typed "threads" to see what threads are running in app_server, and both times I saw a thread named "Reason: xxx":
- "w:908:offscreen ("Reason : _numblock >0")
- "w:xxx:offscreen ("Reason : Segment Violation")
EDIT: yup it got captured in the syslog (retrieved over from my beta1 partition) each time, e.g.:
KERN: 1000: DEBUGGER: _numBlocks > 0 KERN: debug_server: Thread 1000 entered the debugger: Debugger call: `_numBlocks > 0' KERN: stack trace, current PC 0x1d7240d01c1 _kern_debugger + 0x9: KERN: (0x7f30c3894cc0) 0x1d72415bd02 free + 0x42 KERN: (0x7f30c3894ce0) 0x118e7531921 _ZN7PainterD0Ev + 0x11 (..) KERN: (0x7f30c3894d00) 0x118e7522480 _ZN13DrawingEngineD2Ev + 0x30 (..) KERN: (0x7f30c3894f10) 0x118e751b81b _ZN9AlphaMask9_GenerateEv + 0x4b (..)
So if I'm not completely mis-understanding this, it's possible that the sequence of events is 1) Web+ crashes, 2) debbugger attempts to open its normal BAlert, 3) that crashes the app_server 4) debugger is invoked a second time, but this time in "console" mode. Though I suppose that sequence does not make a huge difference compared to the previously assumed one.
Well maybe it does -- what if I configure "/boot/home/config/settings/system/debug_server/settings" with a "WebPositive : log" line, or some such ? Will try that next. I'll have less time to dedicate to this from now on though, so might leave it to others. We don't want to make the ticket unbearably long anyway. EDIT: I'm not set up for compiling Haiku ATM.
EDIT:
Also see #16489 for a possible reproducible case (not for everyone ?)
comment:27 by , 4 years ago
Well...
there's some sort of visual bug, where W+ displays a rotation "please wait" symbol (shaped like a round arrow) on a pitch black full-window background, but the symbol moves around the window quickly, as if it had a wild BView.Transform() call.
Yes, there is a known issue with transforms in our WebKit. Clearly it is not my domain of expertise, I wrote the code trying to use what I remember of math courses I took 10+ years ago, can't get things right this way. However it's harmless, it just draw things at the wrong position.
1) Web+ crashes
I don't think it does. app_server crashes, and the two threads you listed are named w:908:offscreen and w:908:offscreen. These are the two threads that crashed: one because of a segment violation and the other because of heap corruption, apparently.
These "offscreen" threads are created when using a view that draws to an offscreen bitmap (which Web+ does a lot).
Alpha masks were added to app_server specifically for use in WebKit, and I think no other apps (besides a few test ones) are using them. I don't know specifically about youtube, but they are used in many places in WWebKit for drawing, so it's quite likely to be used there.
EDIT: I'm not set up for compiling Haiku ATM.
I have not hit this crash a single time so far, so I don't really know how I can help. But I guess I could try going more to youtube.
comment:28 by , 4 years ago
Priority: | normal → blocker |
---|
comment:29 by , 4 years ago
With the current version of youtube it is not possible anymore to play videos. Are there other websites that reproduce this problem?
comment:30 by , 4 years ago
Are there other websites that reproduce this problem?
Not that I know of...
comment:31 by , 4 years ago
In that case, shall we move this issue out of the beta3 release or close as not reproducible for now?
comment:32 by , 4 years ago
Considering this is another AlphaMask-related crash, it seems probable that it is or was related to #16489, which has a known cause thanks to the guarded heap, so perhaps that one should get more attention then.
comment:33 by , 4 years ago
#16489 is using current webkit builds and current nightlies, which use a completely different code path (new drawing modes implemented by KapiX). And we know that the same website used in #16489 does not crash when using current releases.
So, #16489 is a regression from beta2, while this ticket is about code that was already there in beta2. It is safe to conclude that they are unrelated (and we probably have, or had, multiple such problems in app_server).
Also, #16489 does not crash app_server anymore in current nightlies. It now runs out of ports.
comment:34 by , 4 years ago
Resolution: | → not reproducible |
---|---|
Status: | reopened → closed |
Closing as not reproductible because the original issue here is not reproductible due to Youtube not allowing video playback in WebPositive at all anymore.
Sorry to everyone else adding different and unrelated crash reports and trying weird mixes of different haiku revisions, but you make it impossible to follow what's going on. And to Haiku developers insisting that all tickets that involve webpositive or app_server are probably related: that's unlikely. Both are big pieces of code and can very well have multiple bugs.
If you still have a crash in some case, unless you are really sure it is exactly the same problem as the original report, please open a separate ticket and explain which website you were navigating and what you did.
In general, it's always easier to close a ticket as duplicate than untangling a long stream of comments in a single ticket discussing different and possibly unrelated issues.
complete debug report