Opened 9 years ago

Closed 5 years ago

#6738 closed bug (fixed)

app_server hangs?

Reported by: kirilla Owned by: stippi
Priority: normal Milestone: R1
Component: Servers/app_server Version: R1/Development
Keywords: Cc:
Blocked By: Blocking: #6929, #10258
Has a Patch: no Platform: All

Description

App_server hangs, from what I can tell, occasionally, in Haiku hrev37150.

Mouse clicks stop doing anything meaningful. I can't raise or lower windows, can't shift focus to some other window or view. I -can- move the mouse around, on-screen, but after a while the pointer hangs too.

Volountarily entering KDL to have a look-see, reveals that on my quad core, three idle threads are running, and there's a w:offscreen something thread running, which belongs to app_server. A backtrace hints at Painter, agg and a chain of 10-20 calls on a recursive bezier function.

A locking issue in Painter, maybe? I think WebPositive is the app which triggers this. Will try to look out for a website that triggers it.

Sadly I have no means to provide serial output or snapshots right now.

Attachments (1)

syslog (14.1 KB ) - added by jonas.kirilla 9 years ago.

Download all attachments as: .zip

Change History (17)

comment:1 by jonas.kirilla, 9 years ago

It happened again in hrev39121. WebPositive triggered it.

KDL showed a thread running somewhere in this code: http://haiku.it.su.se:8180/source/xref/src/libs/agg/src/agg_curves.cpp#388

recursing, with the level argument increasing in steps of 1, from 0 to 32.

There's a curve_recursion_limit = 32.

Perhaps the

if (level > curve_recursion_limit)

was meant to be

if (level >= curve_recursion_limit)

comment:2 by BMeow, 9 years ago

I can reliable reproduce this issue on real hardware and on Qemu by trying to view http://qt.gitorious.org/qt or any other Gitorious project site in Web+. Tested on hrev39121 gcc4 anyboot nightly.

comment:3 by axeld, 9 years ago

Owner: changed from axeld to stippi
Status: newassigned

comment:4 by jonas.kirilla, 9 years ago

This could perhaps shed some light on the matter: http://www.antigrain.com/research/adaptive_bezier/index.html

My stack traces show the eight arguments (of the four points) as all 0xff...ffe for maybe ten of the last runs of recursion.

comment:5 by jonas.kirilla, 9 years ago

Sorry, the hex value should be ff ff ff ff e0 00 00 00.

Version 0, edited 9 years ago by jonas.kirilla (next)

in reply to:  description comment:6 by bonefish, 9 years ago

Replying to kirilla:

Sadly I have no means to provide serial output or snapshots right now.

When you leave KDL the session is written to the syslog after a few seconds (at least if the kernel is still working and the syslog daemon is still running). So the info should be available after reboot (or even in the same session via ssh, if that is still working).

No clue what double value that hex value represents (is it not shown?). I suppose a dump of the curve_div4 object would help, too, if anyone wants to try and understand what is happening exactly.

by jonas.kirilla, 9 years ago

Attachment: syslog added

comment:7 by jonas.kirilla, 9 years ago

Scratch that. As can be seen in the syslog excerpt, the hex value is 0xffffffe000000000. These are coordinates, in double format.

The webpage http://babbage.cs.qc.edu/IEEE-754/64bit.html suggests that this hexadecimal representation of a floating-point number is -NaN, not a number.

http://en.wikipedia.org/wiki/NaN

comment:8 by bonefish, 9 years ago

What apparently happens:

  1. The client sends a drawing command (drawing a shape) with invalid parameters (e.g. NaN coordinates).
  2. The app server doesn't check for invalid values (or misses this case) and calls curve4_div() with invalid parameters.
  3. curve4_div::recursive_bezier() always recurses to the last level, causing 234 - 1 calls which should keep the CPU quite busy. I haven't checked, but possibly it also tries to add a few billion points to the object's point array, which would cause serious memory issues. But even if it doesn't, the CPU hogging alone (probably while holding some lock) could already make the app server appear to hang.

So the measures to be taken are:

  • Fix parameter checking in the app server.
  • Possibly add sanity limits to curve4_div::recursive_bezier().
  • Fix the client side (assuming that it is indeed the source of the bad values).

comment:9 by jonas.kirilla, 9 years ago

What if it's a math issue?

I had a quick look at our math_test, but couldn't get it to build.

Running http://www.netlib.org/fp/ucbtest.tgz (port attempt: http://www.kirilla.com/tmp/ucb_haiku.diff) seems to suggest there are issues with Haiku (or my hardware?).

Maybe this could be of interest: http://netbsd-soc.sourceforge.net/projects/mathlib/

comment:10 by anevilyak, 9 years ago

Blocking: 6929 added

(In #6929) Duplicate of #6738.

comment:11 by anevilyak, 9 years ago

If helpful, #6929 has some sites that reliably reproduce this issue.

comment:12 by stippi, 9 years ago

I have tried to reproduce the issue with some debugging added to the BShape iteration function in Painter, but neither with the Firefox add-on page from #6929, nor with the Gitorous page for Qt can I reproduce the problem in QEMU. I will check it out on real hardware next. The revision I am running is hrev40374 build as GCC4 based hybrid.

comment:13 by diver, 9 years ago

Can't reproduce #6929 either.

comment:14 by pulkomandy, 5 years ago

Currently reproducible by going to goodsearch.com using NetSurf 3.2.

comment:15 by pulkomandy, 5 years ago

Blocking: 10258 added

(In #10258) The test case in WebKit does not trigger the issue anymore, but there are other ways to reproduce this in #6738.

comment:16 by pulkomandy, 5 years ago

Resolution: fixed
Status: assignedclosed

Fixed in hrev48056.

Note: See TracTickets for help on using tickets.