Opened 4 years ago

Closed 4 years ago

Last modified 3 years ago

#16489 closed bug (fixed)

App_server crash when browsing URL with WebPositive

Reported by: vidrep Owned by: PulkoMandy
Priority: normal Milestone: R1/beta3
Component: Servers/app_server Version: R1/beta2
Keywords: Cc: ttcoder
Blocked By: Blocking: #16714
Platform: All

Description

hrev54507 x86_64 WebKit rebased HaikuWebKit 1.7.0 WebKit 610.1.26

Navigating on this "newly redesigned" website using WebKit rebased will crash app_server.

https://calgarysun.com/

Attachments (4)

app_server-674-debug-17-08-2020-22-21-21.report (42.0 KB ) - added by vidrep 4 years ago.
IMG_0283.JPG (1.2 MB ) - added by vidrep 4 years ago.
_tts_appsrv_launchscript-20.3-1-x86_64.hpkg (2.0 KB ) - added by ttcoder 4 years ago.
Run app_server with the guarded heap (with launch_daemon script tweak, no need to rebuild the system)
app_server-504-debug-08-09-2020-11-51-00.report (64.2 KB ) - added by ttcoder 4 years ago.
Yay, still reproducible with the guarded heap, and the report gives a couple more hints than usual

Download all attachments as: .zip

Change History (29)

by vidrep, 4 years ago

Attachment: IMG_0283.JPG added

comment:1 by vidrep, 4 years ago

Owner: changed from axeld to PulkoMandy
Status: newassigned

comment:2 by ttcoder, 4 years ago

Is that reproducible? If yes, that might be just what the doctor ordered in #15728 :-). The backtrace is slightly different, but still heavily AlphaMasks related. Though maybe it's better to ask pulkomandy's permission before assigning the ticket to him...

comment:3 by waddlesplash, 4 years ago

It seems highly likely this is the same issue, yes; it appears to be due to heap corruption. Probably someone should run app server or test_app_server under the guarded heap.

comment:4 by pulkomandy, 4 years ago

You can assign tickets to me, but I'm currently in vacations and don't have all my hardware setup to investigate things. So don't expect progress from me in that area. Also, I didn't write the Alpha Mask code so I'm not even particularly well qualified to debug these problems.

comment:5 by ttcoder, 4 years ago

@waddlesplash Is test_app_server runnable from Terminal, maybe even with Web+ as a client ? If so, sounds like the user-space guarded heap would be an easier proposition than using kernel-debugger tools indeed. Something like LD_PRELOAD=libroot_debug.so MALLOC_DEBUG=ges50 test_app_server might turn up the heap corruptions or user-after-frees with less fuss ("work smarter, not harder" :-)

comment:6 by vidrep, 4 years ago

I had my desktop littered with debug reports generated by attempting to navigate that URL I posted. If anyone has a suggestion as to how I might get better data to debug the problem, let me know. PulkoMandy, I saw it was assigned to axeld, and since the trigger for the KDL was Web+, I assumed it might be in your purview. Enjoy your vacation.

comment:7 by mmlr, 4 years ago

waddlesplash was referring to the userland guarded heap, as the kernel one wouldn't be of any help. I have tried that yesterday for a long time but am entirely unable to reproduce the issue on the mentioned site. I can also not reproduce this on youtube, but video playback is broken there for me as it claims the browser doesn't support the video format.

Making the app_server run under the guarded heap is not too complicated btw. You can add the environment variables to the launch definition in /system/data/launch/system by adding a block like this to the app_server entry:

app_server... {
   env {
       LD_PRELOAD libroot_debug.so
       MALLOC_DEBUG grs25
   }
   ...
}

Building an updated image or updating the haiku.hpkg with that will make app_server (and input_server, as that is started directly by app_server) run under the guarded heap. Note that the r flag above also disables memory reuse for maximum effect, but will also burn through RAM quickly. So you may need to change this to just gs25 above if it isn't quick enough to reproduce. Once memory is used up in either case, the app_server will likely just hang and a hard reboot will be needed. This has a chance of filesystem corruption, so make sure that nothing important is left unbackuped.

comment:8 by ttcoder, 4 years ago

Rather than re-package, one may also use the non-packaged hierarchy. I did this:

cd /system/non-packaged/data/
mkdir launch
cat > launch/system << EOF
service x-vnd.Haiku-app_server {
launch /system/servers/app_server
  env {
    LD_PRELOAD libroot_debug.so
    MALLOC_DEBUG grs25
  }
}

This won't boot though. Even if I remove the "r" to allow memory re-use, Haiku remains stuck on the last ("rocket") icon. KDL can be invoked, and shows the four CPUs are running ide_thread, syslog_daemon (executing vm_something and pending_ici) etc. Invoking "teams" shows there are only 6 teams running, none of which is app_server. Any way to strip down "gs25" some more and still keep it useful ? I have 4 GB of RAM. (x64 of course).

comment:9 by ttcoder, 4 years ago

Cc: ttcoder added

in reply to:  8 comment:10 by mmlr, 4 years ago

Replying to ttcoder:

Rather than re-package, one may also use the non-packaged hierarchy. I did this:

This only works if you remove the original app_server entry from the packaged launch/system file. Having app_server in both locations results in a cyclic dependency in launch_daemon for the init target, presumably the app_server depending on itself due to it being present twice. Whether or not this is a bug or intended behaviour I have not investigated.

Launching app_server with the guarded heap doesn't add much overhead initially, so it's definitely not a problem of running out of memory. After a while, the consumption will add up of course, especially with memory reuse disabled.

by ttcoder, 4 years ago

Run app_server with the guarded heap (with launch_daemon script tweak, no need to rebuild the system)

comment:11 by ttcoder, 4 years ago

So I now override the data/launch/system file with an hpkg, that works.

With "r" to disable memory re-use, app_server quickly reached 1.97 GB memory usage and the machine locks up solid. Tried a second time, it locked up at almost the same ceiling (1.98 GB).

In both cases I had a hell of a time to reboot the machine (trying to invoke KDL, to hold Ctrl-Alt-Del for several seconds etc).

Now I'm trying with memory re-use allowed (see attached hpkg) and I'm at a comfortable 220 MB memory usage. I'm going to try my luck with YouTube.

I find it interesting that Haiku applications would crash at the 231 bytes mark by the way, I though such limits were gone when using the 64 bit variant of Haiku, and the OS could make use of the whole range of physical memory (4 GB, 8 GB, whatever), not just as a collective, but also each app individually.

comment:12 by pulkomandy, 4 years ago

The whole story is a bit more complex. But basically, the thing is, the limit is gone, but the memory allocator isn't aware of it yet. So, with malloc() you can only get about 2GB. But you can use mmap or create_area and then do your own things there, and for that, the limit is removed.

We have at some point moved to rpmalloc which removes that limitation, but it turns out, it needs quite a lot of memory space (being designed for 64bit systems where this isn't a problem) and on 32bit it would run out of memory even earlier. So we have reverted that for now. We will be testing other allocators at some point (possibly the musl one) for the default setting.

But for debugging, we use yet another (intentionally simpler) allocator. We can maybe tweak its initial size reservation on 64bit?

by ttcoder, 4 years ago

Yay, still reproducible with the guarded heap, and the report gives a couple more hints than usual

comment:13 by ttcoder, 4 years ago

Thanks for the explanation @pulkomandy, so it's a limitation of the "hoard" allocator. No big deal. Switching to "memory re-use" (to slow down the creep towards the 2 GB mark) was successful and I could reproduce the crash that way anyway.

The guarded heap gives some additional hints, maybe someone can get lucky with them:

	thread 1220: w:936:offscreen 
		state: Call (thread 1220 tried accessing address 0x2f566000 which is a guard page (base: 0x2f565fc0, size: 54, alignment: 16, allocated by thread: 1220, freed by thread: -1))
...
		0x7f23d58410f0	0x75ba8483ae	void agg::render_scanlines<agg::rasterizer_scanline_aa_subpix<agg::rasterizer_sl_clip<agg::ras_conv_int> >, agg::scanline_p8_subpix, agg::renderer_scanline_subpix_solid<agg::renderer_region<PixelFormat> > >(agg::rasterizer_scanline_aa_subpix<agg::rasterizer_sl_clip<agg::ras_conv_int> >&, agg::scanline_p8_subpix&, agg::renderer_scanline_subpix_solid<agg::renderer_region<PixelFormat> >&) + 0x30e 

The syslog does not have much more:

KERN: user access on kernel area 0x35c4 at 0x000000002f566000
KERN: vm_page_fault: vm_soft_fault returned error 'Permission denied' on fault at 0x2f566000, ip 0x75ba8483b2, write 1, user 1, thread 0x4c4
KERN: 1220: DEBUGGER: thread 1220 tried accessing address 0x2f566000 which is a guard page (base: 0x2f565fc0, size: 54, alignment: 16, allocated by thread: 1220, freed by thread: -1)

comment:14 by pulkomandy, 4 years ago

Yes, so we can now see that the alpha mask code is apparently drawing outside its mask bitmap:

133	    thread 1220: w:936:offscreen
134	        state: Call (thread 1220 tried accessing address 0x2f566000 which is a guard page (base: 0x2f565fc0, size: 54, alignment: 16, allocated by thread: 1220, freed by thread: -1))
135	
136	        Frame       IP          Function Name
137	        -----------------------------------------------
138	        00000000    0x5afbf44fcf    _kern_debugger + 0x7
139	            Disassembly:
140	                _kern_debugger:
141	                0x0000005afbf44fc8:   48c7c0e1000000  mov $0xe1, %rax
142	                0x0000005afbf44fcf:             0f05  syscall  <--
143	
144	        0x7f23d5840c60  0x5afbfd176d    panic(char const*, ...) + 0xad
145	        0x7f23d5840cc0  0x5afbfd1de4    guarded_heap_segfault_handler(int, __siginfo_t*, void*) + 0x174
146	        0x7f23d5840cc0  0x7fdfa9d0c23b  commpage_signal_handler + 0x2b
147	        0x7f23d58410f0  0x75ba8483ae    void agg::render_scanlines<agg::rasterizer_scanline_aa_subpix<agg::rasterizer_sl_clip<agg::ras_conv_int> >, agg::scanline_p8_subpix, agg::renderer_scanline_subpix_solid<agg::renderer_region<PixelFormat> > >(agg::rasterizer_scanline_aa_subpix<agg::rasterizer_sl_clip<agg::ras_conv_int> >&, agg::scanline_p8_subpix&, agg::renderer_scanline_subpix_solid<agg::renderer_region<PixelFormat> >&) + 0x30e
148	        0x7f23d5841170  0x75ba85384c    BRect Painter::_FillPath<agg::conv_curve<agg::path_base<agg::vertex_block_storage<double, (unsigned int)8, (unsigned int)256> >, agg::curve3, agg::curve4> >(agg::conv_curve<agg::path_base<agg::vertex_block_storage<double, (unsigned int)8, (unsigned int)256> >, agg::curve3, agg::curve4>&) const + 0x32c
149	        0x7f23d58411a0  0x75ba83d0a3    Painter::DrawShape(int const&, unsigned int const*, int const&, BPoint const*, bool, BPoint const&, float) const + 0x73
150	        0x7f23d5841220  0x75ba829d64    DrawingEngine::DrawShape(BRect const&, int, unsigned int const*, int, BPoint const*, bool, BPoint const&, float) + 0x64
151	        0x7f23d5841270  0x75ba828538    ShapeAlphaMask::DrawVectors(Canvas*) + 0x98
152	        0x7f23d5841440  0x75ba8288f3    VectorAlphaMask<ShapeAlphaMask>::_RenderSource(IntRect const&) + 0x263
153	        0x7f23d58414d0  0x75ba827818    AlphaMask::_Generate() + 0x48
154	        0x7f23d5841550  0x75ba827cbf    _ZN9AlphaMask17SetCanvasGeometryE8IntPoint7IntRect.localalias.45 + 0x10f
155	        0x7f23d58415b0  0x75ba7eb342    ServerWindow::_UpdateDrawState(View*) + 0xc2
156	        0x7f23d5841710  0x75ba7f50cf    ServerWindow::_DispatchViewMessage(int, BPrivate::LinkReceiver&) + 0x272f
157	        0x7f23d58417d0  0x75ba7f585f    ServerWindow::_DispatchMessage(int, BPrivate::LinkReceiver&) + 0x34f
158	        0x7f23d5841840  0x75ba7f061b    ServerWindow::_MessageLooper() + 0x23b
159	        0x7f23d5841850  0x75ba7d2507    MessageLooper::_message_thread(void*) + 0x7
160	        0x7f23d5841870  0x5afbf43d77    thread_entry + 0x17
161	        00000000    0x7fdfa9d0c260  commpage_thread_exit + 0

If I remember correctly, the alpha mask code first computes the bounds of actual touched pixels, and then allocates a bitmap just large enough for that. Maybe the computation is incorrect in some case?

With the normal allocator this would corrupt memory (and fail a bit later), but now it is detected sooner, which is a lot more useful.

comment:15 by waddlesplash, 4 years ago

It appears that the picture bounding box player, used by the VectorAlphaMask to create the bitmap, does not support draw_picture or set_clipping_rects: https://github.com/haiku/haiku/blob/master/src/servers/app/PictureBoundingBoxPlayer.cpp#L441

comment:16 by pulkomandy, 4 years ago

I've been navigating this website for about 15 minutes and got no crash. Does this still happen for anyone?

comment:17 by vidrep, 4 years ago

I tried just now. First try froze my system. No mouse or keyboard. I had to do a hard reboot. Second try resulted in a Web+ crash. This was in the syslog:

KERN: 939: DEBUGGER: Could not create BWindow's receive port, used for interacting with the app_server! KERN: _user_debugger(): Failed to install debugger. Message is: `Could not create BWindow's receive port, used for interacting with the app_server!' KERN: thread_hit_debug_event(): Failed to create debug port: No more ports available

comment:18 by pulkomandy, 4 years ago

That's a different problem, creating too many offscreen bitmaps and running out of ports because each (view-accepting) bitmap needs a port. It's been there for a few years already. I hope the next WebKit release using the new app_server compositing code will improve the situation by reducing the number of temporary offscreens we need to create.

I was running in QEMU with a single CPU core and not that much memory (512MB then I increased to 768, in both case it was eventually all used). I will try with more RAM to see if I can get it to run out of ports before it runs out of RAM...

comment:19 by humdinger, 4 years ago

FWIW, here (64bit, 16gb RAM, Web+rebased-Dec-6-2020), app_server doesn't crash, but Web+ closes unceremoneously with this in the syslog:

KERN: 948: DEBUGGER: Could not create BWindow's receive port, used for interacting with the app_server!
KERN: _user_debugger(): Failed to install debugger. Message is: `Could not create BWindow's receive port, used for interacting with the app_server!'
KERN: thread_hit_debug_event(): Failed to create debug port: No more ports available

comment:20 by diver, 4 years ago

Now that youtube is working again in Web+ I can reproduce it by playing some video for a minute. This is with app_server running under libroot_debug.so (without MALLOC_DEBUG option tho):

state: Call (someone wrote beyond small allocation at 0x1aa8513d680; 
             size: 104 bytes; allocated by 9827; value: 0x51a1b1c1c1d1e)

		Frame		IP			Function Name
		-----------------------------------------------
		00000000	0x1e4acfd370f	_kern_debugger + 0x7 
			Disassembly:
				_kern_debugger:
				0x000001e4acfd3708:   48c7c0e4000000  mov $0xe4, %rax
				0x000001e4acfd370f:             0f05  syscall  <--

		0x7fa5b39f6cd0	0x1e4ad05da8d	panic(char const*, ...) + 0xad 
		0x7fa5b39f6d30	0x1e4ad05f5e8	heap_free(heap_allocator_s*, void*) + 0x158 
		0x7fa5b39f6dc0	0x1e4ad05ff36	debug_heap_free(void*) + 0x26 
		0x7fa5b39f6de0	0x1f167bb6601	Painter::~Painter() + 0x101 
		0x7fa5b39f6e00	0x1f167bb668c	Painter::~Painter() + 0xc 
		0x7fa5b39f6e20	0x1f167ba101d	DrawingEngine::~DrawingEngine() + 0x2d 
		0x7fa5b39f6e40	0x1f167ba103c	DrawingEngine::~DrawingEngine() + 0xc 
		0x7fa5b39f7010	0x1f167b9faf6	VectorAlphaMask<ShapeAlphaMask>::_RenderSource(IntRect const&) + 0x296 
		0x7fa5b39f70a0	0x1f167b9e9e8	AlphaMask::_Generate() + 0x48 
		0x7fa5b39f7120	0x1f167b9ee8f	AlphaMask::SetCanvasGeometry(IntPoint, IntRect) [clone .localalias.50] + 0x10f 
		0x7fa5b39f7180	0x1f167b600f2	ServerWindow::_UpdateDrawState(View*) + 0xc2 
		0x7fa5b39f72e0	0x1f167b6a9ec	ServerWindow::_DispatchViewMessage(int, BPrivate::LinkReceiver&) + 0x273c 
		0x7fa5b39f73a0	0x1f167b6b1af	ServerWindow::_DispatchMessage(int, BPrivate::LinkReceiver&) + 0x34f 
		0x7fa5b39f7410	0x1f167b6541b	ServerWindow::_MessageLooper() + 0x23b 
		0x7fa5b39f7420	0x1f167b43397	MessageLooper::_message_thread(void*) + 0x7 
		0x7fa5b39f7440	0x1e4acfd2487	thread_entry + 0x17 
		00000000	0x7f904dbcc260	commpage_thread_exit + 0 

comment:22 by diver, 4 years ago

I've been testing this fix for the last 25 minutes and app_server doesn't crash! Many thanks! It would be a shame to release beta3 without this fix.

comment:23 by bitigchi, 4 years ago

Best fix of the year! :)

comment:24 by pulkomandy, 4 years ago

Milestone: UnscheduledR1/beta3
Resolution: fixed
Status: assignedclosed

comment:25 by waddlesplash, 3 years ago

Blocking: 16714 added
Note: See TracTickets for help on using tickets.