Opened 18 years ago

Closed 18 years ago

Last modified 18 years ago

#428 closed bug (fixed)

Firefox crashing on load

Reported by: johndrinkwater Owned by: axeld
Priority: normal Milestone: R1
Component: - General Version:
Keywords: Cc: diver, thesuckiestemail@…, simontaylor1@…
Blocked By: Blocking:
Platform: All

Description

starting firefox from the script heralds

[Switching to team ./firefox-bin (132) thread firefox-bin (132)] 0x017e1f9f in nsFtpState::ConvertFilespecToVMS ()

from /boot/apps/firefox/./components/libnecko.so

(gdb) sc #0 0x017e1f9f in nsFtpState::ConvertFilespecToVMS ()

from /boot/apps/firefox/./components/libnecko.so

(gdb)

Change History (34)

comment:1 by diver, 18 years ago

Cc: diver added

comment:2 by thesuckiestemail@…, 18 years ago

Cc: thesuckiestemail@… added

comment:3 by thesuckiestemail@…, 18 years ago

To run without the scripts:

  1. Create a lib folder in firefox-dir.
  2. Move the 'NEEDED' Firefox-libs in objdump -p firefox-bin to lib/

(The stubs should probably be moved, but I think it might be unneccary for Haiku.

  1. Do a mimeset -F firefox-bin (might not be needed)

I suspect that the packaging script for Firefox does stripping, but havn't confirmed that, so there might be a problem.

comment:4 by thesuckiestemail@…, 18 years ago

All other libs are loaded with full path through 'load_add_on(const char *pathname)'

comment:5 by johndrinkwater, 18 years ago

Now trying with the binary;

It enters the debugger again, [Switching to team /boot/apps/firefox/./firefox-bi (196) thread firefox-bin (196)] 0x00217eb3 in nsToolkitProfileService::CreateProfile ()

from /boot/apps/firefox/./lib/libxul.so

(gdb)

comment:6 by thesuckiestemail@…, 18 years ago

Ok, http://lxr.mozilla.org/ and specifically http://lxr.mozilla.org/seamonkey/ is great for browsing the code..

comment:9 by thesuckiestemail@…, 18 years ago

For the first one we can activate logging, if it's a debug-build or logging was built in by running this in terminal we start firefox in: export NSPR_LOG_MODULES=nsFTP:5 (As defined here: http://lxr.mozilla.org/seamonkey/source/netwerk/protocol/ftp/src/nsFtpProtocolHandler.cpp#103 )

comment:10 by thesuckiestemail@…, 18 years ago

What is the message: seg-fault?

comment:11 by johndrinkwater, 18 years ago

For the most recent report: a seg-violation. No more output from bt, sadly.

I can no longer get firefox to reproduce the nsFtpState::ConvertFilespecToVMS debug call. Thanks a lot for the pointers; if anything else crops up, i'll know where to look.

Could it be that Haiku is missing some dirs that Mozilla presumes exist?

comment:12 by thesuckiestemail@…, 18 years ago

I can't really see a reason why it crashes in those functions instead of objects those functions use, so not sure. Unless those objects themselves are invalid.

comment:13 by johndrinkwater, 18 years ago

Hmm. It sounds like it would be good for someone else to try Firefox under Haiku

  • I've encountered errors before due to low memory; I just hope this isn't related.

comment:14 by simontaylor1@…, 18 years ago

I've done a few tests with an old ff tree I have lying around.

Copying a working profile from your R5 disk should get around the CreateProfile crash.

Mine now crashes out in static_initialization_and_destruction_0 in libnecko. This is a function gcc creates to initialise the static members of the shared library when it is linked. My guess is it crashes due to some binary incompatibility between the R5 libs and the ones on Haiku (different size of a datatype perhaps?)

This is the simple HelloWorld Be sample app, I also linked it against libnecko: http://www.srcf.ucam.org/~sjt59/hwdbbg.zip - it works fine on R5, crashes while loading on Haiku. (libnecko needs the other libs but Haiku can load it fine just linked to the others, so the problem is definately in libnecko itself)

My hunch is that it's to do with the net libs, although copying libnet.so from R5 to the hwdbg/lib/ directory gave errors about missing symbols and didn't even attempt to launch the program.

I'm a bit stuck now; some help from anyone who knows more about shared libraries and stuff would be great.

comment:15 by simontaylor1@…, 18 years ago

Cc: simontaylor1@… added

comment:16 by thesuckiestemail@…, 18 years ago

That's above my head unfortunatly.

comment:17 by simontaylor1@…, 18 years ago

It's above mine too. I've tried some more investigation:

The tools I was looking for were the binutils things - nm and readelf. I got readelf to list all the "local objects" in libnecko (they seem to be the static things) - there are lots (1000s I guess). I went through a few of them that sounded interesting (the vast majority were just the IIDs used to refer to objects or factories) using LXR but they were nothing particularly special - static char[]s mainly or some other static structs containing basic types - nothing that I thought should cause a problem.

There was one that initialised a log variable by calling a function deep in nspr and perhaps caused pr initialisation, but I added a printf to the PR_NewLogModule function and it seems that actually works OK, and the crash is somewhere else.

Axel, could this be a memory thing caused by having so many static objects (I don't know where in memory static objects are created) - or would that give a KDL or something different to a normal segfault?

Is there any way of finding out exactly which part of static_initialization_and_destruction_0 causes the segfault from gdb?

The reason using the R5 libs didn't work is that R5 libnet.so needs libbe.so from R5, which is missing the memset_internal symbol that I think the R5 kernel magically patches at run-time or something. Whatever, it didn't work...ooh maybe I could try using the Haiku libs in R5 and seeing if that crashes...I'll report back on that later.

comment:18 by simontaylor1@…, 18 years ago

Latest "progress":

The app does seem to get through the static initialiser phase running on R5 with Haiku libraries - it crashes in the BApplication constructor but that's not suprising with Haiku having a different protocol for app_server communication.

That points to some problem with the runtime_loader initialising static objects perhaps. There are some repeated objects in the same context in libnecko - perhaps they are declared static in a header file or something. I'm going to investigate that a bit now. Mac OS X seemed to have a problem with that from this thread I found: http://gcc.gnu.org/ml/gcc/2005-01/msg00505.html

comment:19 by simontaylor1@…, 18 years ago

I found one static variable initialised in a header: http://lxr.mozilla.org/seamonkey/source/netwerk/base/src/nsIOService.h#72

I copied that into a header, made a .so with 2 different files that included it, and a simple app that linked to the so. That worked fine.

Now I really am pretty stumped, and need to work out how to get more info from gdb about where it crashes exactly. I'm not about to check all 4019 static objects in libnecko! Saying that, I may try checking the pointers as they are the most likely to be causing the crash in their constructors I reckon, assuming it's not just a memory/some other limit thing from there being so many symbols. Hopefully there won't be that many, I'll grep the list and see...

comment:20 by simontaylor1@…, 18 years ago

Yay, I've tracked down this crash. I expect they'll be more so I've opened a new one for that specific issue and made this one depend on it. It's #490.

For some reason, calling gettimeofday (a libroot function) from a shared library causes a segfault. It was happening during static initialisation as there is a static member variable to hold the current number of seconds here (PR_NOW is a macro that expands to a function calling gettimeofday on BeOS): http://lxr.mozilla.org/seamonkey/source/netwerk/protocol/ftp/src/nsFtpConnectionThread.cpp#1519

My mozilla source is a little old so current ff would probably crash in a different place. The bug still occurs whenever the gettimeofday function is called from a shared library though - there's a simple test app that demonstrates that attached to #490.

That same bug is probably responsible for the crash in CreateProfile too - that uses the PR_NOW function to generate the random name for the new profile directory.

comment:21 by simontaylor1@…, 18 years ago

dependson: 490

comment:22 by thesuckiestemail@…, 18 years ago

I think I had a modified NSPR that uses inline PR_IMPLEMENT(PRTime) PR_Now(void) {

return (PRTime) real_time_clock_usecs();

} before.

comment:23 by simontaylor1@…, 18 years ago

Good one Fredrik, that's a good workaround until #490 is fixed. Gets a bit further now, crashing out in Date() in libmozjs.so - that actually calls another macro PRMJ_NOW which also calls our good old friend gettimeofday - I'll change that to use get_real_time_clock_usecs too later, but I'm off out now for a bit.

http://lxr.mozilla.org/seamonkey/source/js/src/prmjtime.c#149 - line 203 is the gettimeofday call.

comment:24 by axeld, 18 years ago

Status: newclosed

comment:25 by axeld, 18 years ago

Resolution: fixed

comment:26 by axeld, 18 years ago

Fixed in hrev17142 (see #490). Firefox is now working, though appears to have some update problems.

comment:27 by stippi, 18 years ago

You guys rock! What a capable QA team Haiku has!

comment:28 by thesuckiestemail@…, 18 years ago

Nice work Simon. Anyway gettimeofday is slower so it should be changed in NSPR anyway.

comment:29 by simontaylor1@…, 18 years ago

Yeh real_time_clock_usecs is definately the way to go.

I quite like the whole QA side of things - as Haiku gets closer to completion it's an area where I intend to spend a bit of time.

My other intended contribution is a port of WebCore (prefered to KHTML simply for real-world web compatibility, as web developers are more likely to test in Safari than Konq), but I'll see how that goes when I get exams out of the way.

First guess about the focus issues in ff is it's something in firefox code - the message passing in moz is pretty horrible I seem to remember to try and make the BeOS model fit the moz one, it may well use some "undocumented assumptions" about BMessages.

Most of the moz window is one BView (iframes are seperate ones too) apart from the little favicon. I've found clicking on or around there often corrects focus issues and means you can click links and stuff. Keyboard navigation seems to work fine all the time though.

Seems to me all the problems could be caused by messages not being forwarded to the correct places - it seems to happen with both mouse messages and invalidate ones, and probably others.

I'll look into it when I get back to uni with a decent net connection to update my ff source.

comment:30 by thesuckiestemail@…, 18 years ago

We don't do very much with BMessage's. The internal eventhandling is currently bad, but should be equally bad under BeOS as Haiku (port-based). Focus and mouse-handling as well as trying to minimize eventflow in the widget-code are much more probable to behave 'different' under Haiku I think. Also the widget-code tries to do a lot of optimization for just about everything to get a BeOS-feel.

I plan to change the internal event-code. My plan is actually to use the App and windows loopers instead of doing things on barebone ports. Although this is not 'easy' to do. First step seperating nsWindow into nsChildView, nsPopupWindow, nsWindow and nsWidget (baseclass) is well on it's way. After that we can set views and popups to process events on the nsWindow's looper.

comment:31 by axeld, 18 years ago

When you're reading mouse messages and the like directly from the window's port, this is likely to fail under Haiku, as the app_server knows less about the window as it does under R5 - the window is going part of the work to get the message right.

comment:32 by thesuckiestemail@…, 18 years ago

Hmm, are you looking at the nsWindow-code, or did you misunderstood my cryptic comment :)

We receive the messages from BeOS the ordinary way. Most code are from historic first attempt at port, before first Mozilla, so it's not very modern though.

The two classes based on BWindow and BView, translates between BeOS and Mozilla's handling here: http://lxr.mozilla.org/seamonkey/source/widget/src/beos/nsWindow.cpp#2734 Notice the ugly recieving and sending it thru a port to the eventhandler's thread, which is what I'm trying to get rid of. It all boils down to that only the app and toplevel windows (windows, dialogs) are allowed to have it's own processing thread. Although we have only one for everything. Views and popups are not allowed to have their own threads.

On the BeOS-side: If I understand BeOS correctly views are using the window thread so only popups would need special handling in a BLooper based internal handling.

comment:33 by simontaylor1@…, 18 years ago

I like your idea about reworking the internal event handling Fredrik, that will probably help things under Haiku. It's true that the difference in behaviour is obviously caused by something in Haiku, I meant it might be that moz does something in a very non-BeOS way and so may well be the only app affected. Sorry I haven't kept up on events on bezilla, I lost interest in all things BeOS for a while.

For anyone who is looking at the moz source for the first time, it took me a while to realise that nsWindow in moz is sort of equivalent to a BView (I guess taken from MS Windows where a hwnd can is a "window handle" to pretty much any widget) - it's not necessarily a top-level BWindow.

http://lxr.mozilla.org/seamonkey/source/widget/src/beos/nsToolkit.cpp#256 looks like messages are just dropped silently if the port is full...that'll be the first place I'll look. I guess the BeOS "workaround" is also not necessary for Haiku and might not help the situation. I should be able to do some investigation using my old source tree for a bit (tho I really should be revising!)

comment:34 by thesuckiestemail@…, 18 years ago

Well nsWindow are in current CVS a combined baseclass and class of toplevel windows. There is a patch to split that too in the works. nsPopupWindow are for popups and childviews (the real BView equivalents) are in nsChildView.

This is a work in progress, so a lot of code is still loitering in nsWindow.

Here is the latest code: https://bugzilla.mozilla.org/show_bug.cgi?id=333843

Note: See TracTickets for help on using tickets.