Opened 8 months ago

Last modified 3 months ago

#16489 assigned bug

App_server crash when browsing URL with WebPositive

Reported by: vidrep Owned by: PulkoMandy
Priority: normal Milestone: Unscheduled
Component: Servers/app_server Version: R1/beta2
Keywords: Cc: ttcoder
Blocked By: Blocking:
Platform: All

Description

hrev54507 x86_64 WebKit rebased HaikuWebKit 1.7.0 WebKit 610.1.26

Navigating on this "newly redesigned" website using WebKit rebased will crash app_server.

https://calgarysun.com/

Attachments (4)

app_server-674-debug-17-08-2020-22-21-21.report (42.0 KB ) - added by vidrep 8 months ago.
IMG_0283.JPG (1.2 MB ) - added by vidrep 8 months ago.
_tts_appsrv_launchscript-20.3-1-x86_64.hpkg (2.0 KB ) - added by ttcoder 7 months ago.
Run app_server with the guarded heap (with launch_daemon script tweak, no need to rebuild the system)
app_server-504-debug-08-09-2020-11-51-00.report (64.2 KB ) - added by ttcoder 7 months ago.
Yay, still reproducible with the guarded heap, and the report gives a couple more hints than usual

Download all attachments as: .zip

Change History (23)

by vidrep, 8 months ago

Attachment: IMG_0283.JPG added

comment:1 by vidrep, 8 months ago

Owner: changed from axeld to PulkoMandy
Status: newassigned

comment:2 by ttcoder, 8 months ago

Is that reproducible? If yes, that might be just what the doctor ordered in #15728 :-). The backtrace is slightly different, but still heavily AlphaMasks related. Though maybe it's better to ask pulkomandy's permission before assigning the ticket to him...

comment:3 by waddlesplash, 8 months ago

It seems highly likely this is the same issue, yes; it appears to be due to heap corruption. Probably someone should run app server or test_app_server under the guarded heap.

comment:4 by pulkomandy, 8 months ago

You can assign tickets to me, but I'm currently in vacations and don't have all my hardware setup to investigate things. So don't expect progress from me in that area. Also, I didn't write the Alpha Mask code so I'm not even particularly well qualified to debug these problems.

comment:5 by ttcoder, 8 months ago

@waddlesplash Is test_app_server runnable from Terminal, maybe even with Web+ as a client ? If so, sounds like the user-space guarded heap would be an easier proposition than using kernel-debugger tools indeed. Something like LD_PRELOAD=libroot_debug.so MALLOC_DEBUG=ges50 test_app_server might turn up the heap corruptions or user-after-frees with less fuss ("work smarter, not harder" :-)

comment:6 by vidrep, 8 months ago

I had my desktop littered with debug reports generated by attempting to navigate that URL I posted. If anyone has a suggestion as to how I might get better data to debug the problem, let me know. PulkoMandy, I saw it was assigned to axeld, and since the trigger for the KDL was Web+, I assumed it might be in your purview. Enjoy your vacation.

comment:7 by mmlr, 8 months ago

waddlesplash was referring to the userland guarded heap, as the kernel one wouldn't be of any help. I have tried that yesterday for a long time but am entirely unable to reproduce the issue on the mentioned site. I can also not reproduce this on youtube, but video playback is broken there for me as it claims the browser doesn't support the video format.

Making the app_server run under the guarded heap is not too complicated btw. You can add the environment variables to the launch definition in /system/data/launch/system by adding a block like this to the app_server entry:

app_server... {
   env {
       LD_PRELOAD libroot_debug.so
       MALLOC_DEBUG grs25
   }
   ...
}

Building an updated image or updating the haiku.hpkg with that will make app_server (and input_server, as that is started directly by app_server) run under the guarded heap. Note that the r flag above also disables memory reuse for maximum effect, but will also burn through RAM quickly. So you may need to change this to just gs25 above if it isn't quick enough to reproduce. Once memory is used up in either case, the app_server will likely just hang and a hard reboot will be needed. This has a chance of filesystem corruption, so make sure that nothing important is left unbackuped.

comment:8 by ttcoder, 8 months ago

Rather than re-package, one may also use the non-packaged hierarchy. I did this:

cd /system/non-packaged/data/
mkdir launch
cat > launch/system << EOF
service x-vnd.Haiku-app_server {
launch /system/servers/app_server
  env {
    LD_PRELOAD libroot_debug.so
    MALLOC_DEBUG grs25
  }
}

This won't boot though. Even if I remove the "r" to allow memory re-use, Haiku remains stuck on the last ("rocket") icon. KDL can be invoked, and shows the four CPUs are running ide_thread, syslog_daemon (executing vm_something and pending_ici) etc. Invoking "teams" shows there are only 6 teams running, none of which is app_server. Any way to strip down "gs25" some more and still keep it useful ? I have 4 GB of RAM. (x64 of course).

comment:9 by ttcoder, 8 months ago

Cc: ttcoder added

in reply to:  8 comment:10 by mmlr, 8 months ago

Replying to ttcoder:

Rather than re-package, one may also use the non-packaged hierarchy. I did this:

This only works if you remove the original app_server entry from the packaged launch/system file. Having app_server in both locations results in a cyclic dependency in launch_daemon for the init target, presumably the app_server depending on itself due to it being present twice. Whether or not this is a bug or intended behaviour I have not investigated.

Launching app_server with the guarded heap doesn't add much overhead initially, so it's definitely not a problem of running out of memory. After a while, the consumption will add up of course, especially with memory reuse disabled.

by ttcoder, 7 months ago

Run app_server with the guarded heap (with launch_daemon script tweak, no need to rebuild the system)

comment:11 by ttcoder, 7 months ago

So I now override the data/launch/system file with an hpkg, that works.

With "r" to disable memory re-use, app_server quickly reached 1.97 GB memory usage and the machine locks up solid. Tried a second time, it locked up at almost the same ceiling (1.98 GB).

In both cases I had a hell of a time to reboot the machine (trying to invoke KDL, to hold Ctrl-Alt-Del for several seconds etc).

Now I'm trying with memory re-use allowed (see attached hpkg) and I'm at a comfortable 220 MB memory usage. I'm going to try my luck with YouTube.

I find it interesting that Haiku applications would crash at the 231 bytes mark by the way, I though such limits were gone when using the 64 bit variant of Haiku, and the OS could make use of the whole range of physical memory (4 GB, 8 GB, whatever), not just as a collective, but also each app individually.

comment:12 by pulkomandy, 7 months ago

The whole story is a bit more complex. But basically, the thing is, the limit is gone, but the memory allocator isn't aware of it yet. So, with malloc() you can only get about 2GB. But you can use mmap or create_area and then do your own things there, and for that, the limit is removed.

We have at some point moved to rpmalloc which removes that limitation, but it turns out, it needs quite a lot of memory space (being designed for 64bit systems where this isn't a problem) and on 32bit it would run out of memory even earlier. So we have reverted that for now. We will be testing other allocators at some point (possibly the musl one) for the default setting.

But for debugging, we use yet another (intentionally simpler) allocator. We can maybe tweak its initial size reservation on 64bit?

by ttcoder, 7 months ago

Yay, still reproducible with the guarded heap, and the report gives a couple more hints than usual

comment:13 by ttcoder, 7 months ago

Thanks for the explanation @pulkomandy, so it's a limitation of the "hoard" allocator. No big deal. Switching to "memory re-use" (to slow down the creep towards the 2 GB mark) was successful and I could reproduce the crash that way anyway.

The guarded heap gives some additional hints, maybe someone can get lucky with them:

	thread 1220: w:936:offscreen 
		state: Call (thread 1220 tried accessing address 0x2f566000 which is a guard page (base: 0x2f565fc0, size: 54, alignment: 16, allocated by thread: 1220, freed by thread: -1))
...
		0x7f23d58410f0	0x75ba8483ae	void agg::render_scanlines<agg::rasterizer_scanline_aa_subpix<agg::rasterizer_sl_clip<agg::ras_conv_int> >, agg::scanline_p8_subpix, agg::renderer_scanline_subpix_solid<agg::renderer_region<PixelFormat> > >(agg::rasterizer_scanline_aa_subpix<agg::rasterizer_sl_clip<agg::ras_conv_int> >&, agg::scanline_p8_subpix&, agg::renderer_scanline_subpix_solid<agg::renderer_region<PixelFormat> >&) + 0x30e 

The syslog does not have much more:

KERN: user access on kernel area 0x35c4 at 0x000000002f566000
KERN: vm_page_fault: vm_soft_fault returned error 'Permission denied' on fault at 0x2f566000, ip 0x75ba8483b2, write 1, user 1, thread 0x4c4
KERN: 1220: DEBUGGER: thread 1220 tried accessing address 0x2f566000 which is a guard page (base: 0x2f565fc0, size: 54, alignment: 16, allocated by thread: 1220, freed by thread: -1)

comment:14 by pulkomandy, 7 months ago

Yes, so we can now see that the alpha mask code is apparently drawing outside its mask bitmap:

133	    thread 1220: w:936:offscreen
134	        state: Call (thread 1220 tried accessing address 0x2f566000 which is a guard page (base: 0x2f565fc0, size: 54, alignment: 16, allocated by thread: 1220, freed by thread: -1))
135	
136	        Frame       IP          Function Name
137	        -----------------------------------------------
138	        00000000    0x5afbf44fcf    _kern_debugger + 0x7
139	            Disassembly:
140	                _kern_debugger:
141	                0x0000005afbf44fc8:   48c7c0e1000000  mov $0xe1, %rax
142	                0x0000005afbf44fcf:             0f05  syscall  <--
143	
144	        0x7f23d5840c60  0x5afbfd176d    panic(char const*, ...) + 0xad
145	        0x7f23d5840cc0  0x5afbfd1de4    guarded_heap_segfault_handler(int, __siginfo_t*, void*) + 0x174
146	        0x7f23d5840cc0  0x7fdfa9d0c23b  commpage_signal_handler + 0x2b
147	        0x7f23d58410f0  0x75ba8483ae    void agg::render_scanlines<agg::rasterizer_scanline_aa_subpix<agg::rasterizer_sl_clip<agg::ras_conv_int> >, agg::scanline_p8_subpix, agg::renderer_scanline_subpix_solid<agg::renderer_region<PixelFormat> > >(agg::rasterizer_scanline_aa_subpix<agg::rasterizer_sl_clip<agg::ras_conv_int> >&, agg::scanline_p8_subpix&, agg::renderer_scanline_subpix_solid<agg::renderer_region<PixelFormat> >&) + 0x30e
148	        0x7f23d5841170  0x75ba85384c    BRect Painter::_FillPath<agg::conv_curve<agg::path_base<agg::vertex_block_storage<double, (unsigned int)8, (unsigned int)256> >, agg::curve3, agg::curve4> >(agg::conv_curve<agg::path_base<agg::vertex_block_storage<double, (unsigned int)8, (unsigned int)256> >, agg::curve3, agg::curve4>&) const + 0x32c
149	        0x7f23d58411a0  0x75ba83d0a3    Painter::DrawShape(int const&, unsigned int const*, int const&, BPoint const*, bool, BPoint const&, float) const + 0x73
150	        0x7f23d5841220  0x75ba829d64    DrawingEngine::DrawShape(BRect const&, int, unsigned int const*, int, BPoint const*, bool, BPoint const&, float) + 0x64
151	        0x7f23d5841270  0x75ba828538    ShapeAlphaMask::DrawVectors(Canvas*) + 0x98
152	        0x7f23d5841440  0x75ba8288f3    VectorAlphaMask<ShapeAlphaMask>::_RenderSource(IntRect const&) + 0x263
153	        0x7f23d58414d0  0x75ba827818    AlphaMask::_Generate() + 0x48
154	        0x7f23d5841550  0x75ba827cbf    _ZN9AlphaMask17SetCanvasGeometryE8IntPoint7IntRect.localalias.45 + 0x10f
155	        0x7f23d58415b0  0x75ba7eb342    ServerWindow::_UpdateDrawState(View*) + 0xc2
156	        0x7f23d5841710  0x75ba7f50cf    ServerWindow::_DispatchViewMessage(int, BPrivate::LinkReceiver&) + 0x272f
157	        0x7f23d58417d0  0x75ba7f585f    ServerWindow::_DispatchMessage(int, BPrivate::LinkReceiver&) + 0x34f
158	        0x7f23d5841840  0x75ba7f061b    ServerWindow::_MessageLooper() + 0x23b
159	        0x7f23d5841850  0x75ba7d2507    MessageLooper::_message_thread(void*) + 0x7
160	        0x7f23d5841870  0x5afbf43d77    thread_entry + 0x17
161	        00000000    0x7fdfa9d0c260  commpage_thread_exit + 0

If I remember correctly, the alpha mask code first computes the bounds of actual touched pixels, and then allocates a bitmap just large enough for that. Maybe the computation is incorrect in some case?

With the normal allocator this would corrupt memory (and fail a bit later), but now it is detected sooner, which is a lot more useful.

comment:15 by waddlesplash, 4 months ago

It appears that the picture bounding box player, used by the VectorAlphaMask to create the bitmap, does not support draw_picture or set_clipping_rects: https://github.com/haiku/haiku/blob/master/src/servers/app/PictureBoundingBoxPlayer.cpp#L441

comment:16 by pulkomandy, 3 months ago

I've been navigating this website for about 15 minutes and got no crash. Does this still happen for anyone?

comment:17 by vidrep, 3 months ago

I tried just now. First try froze my system. No mouse or keyboard. I had to do a hard reboot. Second try resulted in a Web+ crash. This was in the syslog:

KERN: 939: DEBUGGER: Could not create BWindow's receive port, used for interacting with the app_server! KERN: _user_debugger(): Failed to install debugger. Message is: `Could not create BWindow's receive port, used for interacting with the app_server!' KERN: thread_hit_debug_event(): Failed to create debug port: No more ports available

comment:18 by pulkomandy, 3 months ago

That's a different problem, creating too many offscreen bitmaps and running out of ports because each (view-accepting) bitmap needs a port. It's been there for a few years already. I hope the next WebKit release using the new app_server compositing code will improve the situation by reducing the number of temporary offscreens we need to create.

I was running in QEMU with a single CPU core and not that much memory (512MB then I increased to 768, in both case it was eventually all used). I will try with more RAM to see if I can get it to run out of ports before it runs out of RAM...

comment:19 by humdinger, 3 months ago

FWIW, here (64bit, 16gb RAM, Web+rebased-Dec-6-2020), app_server doesn't crash, but Web+ closes unceremoneously with this in the syslog:

KERN: 948: DEBUGGER: Could not create BWindow's receive port, used for interacting with the app_server!
KERN: _user_debugger(): Failed to install debugger. Message is: `Could not create BWindow's receive port, used for interacting with the app_server!'
KERN: thread_hit_debug_event(): Failed to create debug port: No more ports available
Note: See TracTickets for help on using tickets.