Opened 10 years ago

Closed 9 years ago

#4188 closed bug (invalid)

Kernel very unstable after heap changes in r32074

Reported by: stippi Owned by: mmlr
Priority: critical Milestone: R1
Component: System/Kernel Version: R1/pre-alpha1
Keywords: Cc:
Blocked By: Blocking:
Has a Patch: no Platform: All

Description (last modified by stippi)

With BFS and block cache tracing enabled, the kernel only survives a few seconds of file operations. In the first session after I updated to hrev32128 from a revision before the heap changes, Haiku crashed shortly after invoking "svn up" in my tree within Haiku. I rebooted and made sure I have a clean build on my partition. In the session after that, I tried svn cleanup, which showed me corrupted parts of my tree. The second rm -rf some/part/of/the/source dumped me back into the kernel debugger, and the kernel crashed again in the BFS tracing when allocating a tracing entry (like the first time). The panic talked about two NULL pointers, if that is helpful, but the crashing seems to be very easy to reproduce anyway (I hope for you, too). Reverting just the hrev32074 changeset seems to have made the system stable again.

Running a hrev32128 GCC 4 based hybrid build.

Change History (17)

comment:1 Changed 10 years ago by stippi

Description: modified (diff)

comment:2 Changed 10 years ago by axeld

FWIW, I cannot reproduce this, and I copied and compiled quite a few things today, running the latest revision. And besides the problem I fixed in file_map.cpp, it did not crash yet :-)

comment:3 Changed 10 years ago by mmlr

I have had (and still have) BFS and block-cache tracing on during the whole heap rework. I didn't experience anything in that direction so far. If at all that would be GCC4 related then as I am running GCC2 only. Can you provide a stack-trace in any case? Generally though allocation of tracing entries does not take place on the heap, but on the pre-allocated tracing buffer, so I don't really see how this could be failing. Of course if the tracing entry itself allocates stuff on the heap that would be something different.

comment:4 Changed 10 years ago by stippi

I am a bit reluctant to just turn it on again, since I just recovered from all the damage this caused. I had to make backups of all the changed folders and completely remove the entire Haiku svn and check it out again. Luckily, I committed all my changes before updating my system. Since I reverted the heap change, the system runs totally stable again. I can try to reproduce the problem in a VM, or maybe I can make a backup of my whole disk again, but it seems since the crashes happened in BFS tracing code for me, the effects are especially devastating. :-) Really sorry about not taking screenshots the first time around. The other day, svn crashed for me reproducible after updating my install, in one of the syscalls. I could fix that by making a clean installation. So when this bug happened the first time, I thought it was a problem of a non-clean build, and didn't look closer at the bug. When it happened the second time, I just guessed at the revision that could have caused it and reverted just that. Only then did I find out about all the file system damage. So that's why I am a bit hesitant, even though I think I could reproduce this. :-)

comment:5 Changed 10 years ago by mmlr

I've just built a GCC4 Haiku with BFS and block-cache tracing on and tested it in a VM. It's already checked out quite a bit through SVN and I didn't see any problem so far. The only thing I could imagine would be that you've changed your tracing settings, but didn't rebuild BFS? Going to try on real hardware next.

comment:6 Changed 10 years ago by axeld

Just for the record: a tracing entry that allocates memory on the heap would be a pretty stupid bug (and a nice memory leak, too).

comment:7 Changed 10 years ago by mmlr

Ok, I've tested an installation of a current GCC4 build (with the mentioned tracing enabled) and it worked fine as far as I've tested it. I couldn't do too much due to it not being connected to the net though.

comment:8 Changed 10 years ago by stippi

I've checked that my user_config_headers don't contain bogus, but they look fine to me. This is the diff of the only header I have there:

--- /ReiserFS Volume (18.6 GB)/home/stippi/haiku/haiku/build/config_headers/tracing_config.h	2008-10-26 21:48:18.000000000 +0100
+++ /ReiserFS Volume (18.6 GB)/home/stippi/haiku/haiku/build/user_config_headers/tracing_config.h	2009-06-23 15:50:08.000000000 +0200
@@ -5,12 +5,12 @@
 // enable tracing (0/1)
-#	define ENABLE_TRACING 0
+#	define ENABLE_TRACING 1
 // tracing buffer size (in bytes)
-#	define MAX_TRACE_SIZE (20 * 1024 * 1024)
+#	define MAX_TRACE_SIZE (300 * 1024 * 1024)
@@ -19,11 +19,11 @@
 // macros specifying the tracing level for individual components (0 is disabled)
 #define AHCI_PORT_TRACING						0
-#define ATA_TRACING								0
-#define ATAPI_TRACING							0
-#define BFS_TRACING								0
+#define ATA_TRACING								1
+#define ATAPI_TRACING							1
+#define BFS_TRACING								1
 #define BMESSAGE_TRACING						0
 #define KERNEL_HEAP_TRACING						0
 #define KTRACE_PRINTF_STACK_TRACE				0	/* stack trace depth */

I have not yet enabled the heap change and tried again.

comment:9 Changed 10 years ago by mmlr

I see you have the extreme block cache tracing on, the one dumping the complete blocks into the tracing buffer. I'll build an image with that on as well and see if this changes anything.

comment:10 Changed 10 years ago by stippi

I have turned tracing off entirely (leaving the tracing header where it is, since the build system currently does not detected removed build config headers), and so far I could svn up just fine. I am now running hrev32184, and will keep running it for a while to be sure. Maybe the problem is indeed only with the extreme BFS tracing? In any case, since the alpha will probably not have tracing on by default (at least it won't unless someone changes the default tracing config), this ticket could certainly be moved out of the alpha milestone. I'll leave that up to you guys, and after some time, I will retry with my previous BFS tracing options and see how that goes.

comment:11 Changed 10 years ago by mmlr

Ok, I've now built a clean image with the same block cache and bfs tracing levels. Tested it by doing a few minutes of svn checking out the Haiku tree and it worked just fine. This is all GCC2 though, as I can't test GCC4 builds right now. But from looking through the block cache tracing code I don't really see how it should be affected by heap changes. I'd suggest you try updating again and see if you can still reproduce the issue.

comment:12 Changed 10 years ago by mmlr

I think this was really caused by the tracing reattach. I've seen a similar crash once and it was when allocating tracing entries after changing something and rebooting (so after reattaching). Maybe we should disable reattaching by default and add a safemode setting in the bootloader since the feature is rarely used outside tracking specific bugs.

comment:13 Changed 10 years ago by axeld

Yes, that definitely makes sense. Maybe only show this option if the kernel (and boot loader) were compiled with tracing turned on.

comment:14 in reply to:  13 Changed 10 years ago by bonefish

The reattaching should be quite robust. Even if a randomly overwritten buffer is used, the checks introduced in hrev31999 should make sure that at least the entry list is completely valid. I.e. allocating entries should never be a problem -- if it crashes, there's obviously a bug which should be fixed. Printing old entries can fail (crash), since the entries themselves are not validated, but printing happens in the kernel debugger only on explicit request and should usually be caught by the fault handler anyway.

So I don't think there's a good reason to disable it by default. I wouldn't mind a kernel tracing setting (compile time), but an opt-in boot menu setting seems quite unhandy, since when you would need the tracing buffer recovery, there's a good chance to forget to enter the boot menu (or miss the time window) and then your old tracing buffer is gone for good. For triple faults it's even worse.

comment:15 Changed 10 years ago by stippi

Milestone: R1/alpha1R1

Whatever the problem was, it's certainly not an alpha-blocker! I am currently running versions with not as much tracing, in case that was the problem, but I cannot reproduce it with those settings at least.

comment:16 Changed 9 years ago by mmlr

Should this one be closed?

comment:17 Changed 9 years ago by stippi

Resolution: invalid
Status: newclosed

Yes, definitely. :-)

Note: See TracTickets for help on using tickets.