Context Navigation

#10279 closed bug (fixed)

random binary crashes, generally while compiling or doing other port work

Reported by:	kallisti5	Owned by:	nobody
Priority:	critical	Milestone:	R1/beta2
Component:	System/Kernel	Version:	R1/Development
Keywords:	vm	Cc:	ttcoder
Blocked By:		Blocking:	#12796
Platform:	All

Description (last modified by pulkomandy)

I've been seeing random application crashes from things that generally don't crash. (sh, python, etc)

I don't think these are isolated issues as in the case of each crash, if I run the task again it generally works. I've been collecting these debug reports and are attaching them to this case.

All of the crashes seem to be memory related in some way... and i've seen the crashes across multiple installations of Haiku.

System:

x86_gcc2
hrev46452, hrev46472 and earlier revs.
AMD Bulldozer, 8 core, 16GB ram.

Seen on:

AMD FX-8320 - kallisti5
AMD FX-8350 - SeanCollins

Attachments (49)

bash-71019-debug-01-12-2013-02-53-55.report (3.9 KB ) - added by kallisti5 11 years ago.
cpp0-136338-debug-03-12-2013-18-14-34.report (6.8 KB ) - added by kallisti5 11 years ago.
python2.6-9561-debug-04-12-2013-17-01-39.report (72.1 KB ) - added by kallisti5 11 years ago.
make-162787-debug-01-12-2013-23-57-13.report (49.1 KB ) - added by bonefish 11 years ago.: debug report for make crash (gcc 4)
python2.6-26838-debug-09-12-2013-02-05-51.report (12.4 KB ) - added by kallisti5 11 years ago.: python, gcc2h hrev46510
python2.6-69264-debug-09-12-2013-04-50-35.report (14.1 KB ) - added by kallisti5 11 years ago.: python, gcc2h hrev46511
cpp0-98369-debug-09-12-2013-05-08-56.report (61.0 KB ) - added by kallisti5 11 years ago.: cpp0, gcc2h hrev46511
python2.6-57810-debug-10-12-2013-03-36-41.report (235.4 KB ) - added by kallisti5 11 years ago.: python test case 1 complete - report
python_kdl_bt.jpg (961.5 KB ) - added by kallisti5 11 years ago.: python test case 1 complete - kdl bt
python_kdl_incontext.jpg (1.2 MB ) - added by kallisti5 11 years ago.: python test case 1 complete - kdl in_context
python_kdl_mapping.jpg (1.2 MB ) - added by kallisti5 11 years ago.: python test case 1 complete - kdl mapping
python2.6-25934-debug-11-12-2013-02-51-26.report (187.7 KB ) - added by kallisti5 11 years ago.: python test case 2 complete - report
python2_kdl_bt.jpg (1.1 MB ) - added by kallisti5 11 years ago.: python test case 2 complete - kdl bt
python2_kdl_incontext.jpg (1.5 MB ) - added by kallisti5 11 years ago.: python test case 2 complete - kdl incontext
python2_kdl_mapping.jpg (1.2 MB ) - added by kallisti5 11 years ago.: python test case 2 complete - kdl mapping
cc1-8547-debug-11-12-2013-04-21-36.report (25.0 KB ) - added by kallisti5 11 years ago.: cc1 test case 1 complete - report
cc_kdl_output.txt (5.9 KB ) - added by kallisti5 11 years ago.: cc1 test case 1 complete - kdl
cc_kdl_output.2.txt (5.6 KB ) - added by kallisti5 11 years ago.: cc1 test case 1 complete - kdl
bash-1323-debug-12-12-2013-02-19-06.report (19.1 KB ) - added by kallisti5 11 years ago.: gcc test case 1 complete - report
bash_kdl.txt (3.8 KB ) - added by kallisti5 11 years ago.: gcc test case 1 complete - kdl
Terminal-741-debug-12-12-2013-05-59-49.report (8.7 KB ) - added by kallisti5 11 years ago.: terminal - report
terminal - kdl.txt (3.7 KB ) - added by kallisti5 11 years ago.: terminal - kdl
cpp0-5832-debug-16-12-2013-02-26-00.report (7.0 KB ) - added by kallisti5 11 years ago.
cpp0-kdl.txt (4.9 KB ) - added by kallisti5 11 years ago.
bash-52196-debug-16-12-2013-04-16-50.report (6.7 KB ) - added by kallisti5 11 years ago.
bash-kdl.txt (8.2 KB ) - added by kallisti5 11 years ago.
bash-7856-debug-16-12-2013-04-51-50.report (34.1 KB ) - added by kallisti5 11 years ago.
bash-7856-kdl.txt (11.7 KB ) - added by kallisti5 11 years ago.
bash-7459-debug-16-12-2013-21-15-21.report (14.8 KB ) - added by kallisti5 11 years ago.: bash - report - mapping of vm_page_fault
bash-7459-kdl.txt (20.6 KB ) - added by kallisti5 11 years ago.: bash - kdl - mapping of vm_page_fault
vm-page-fault-diff (854 bytes ) - added by kallisti5 11 years ago.
bash-14580-debug-17-12-2013-22-11-29.report (6.2 KB ) - added by kallisti5 11 years ago.
bash-7459-kdl.2.txt (20.6 KB ) - added by kallisti5 11 years ago.
bash-14580-kdl-special.txt (19.4 KB ) - added by kallisti5 11 years ago.
bash-24569-kdl.txt (17.9 KB ) - added by kallisti5 11 years ago.
bash-15133-kdl.txt (2.9 KB ) - added by kallisti5 11 years ago.
bash-6976-debug-18-12-2013-01-32-14.report (2.9 KB ) - added by kallisti5 11 years ago.
bash-6976-kdl.txt (3.6 KB ) - added by kallisti5 11 years ago.
sh-13815-kdl.txt (14.6 KB ) - added by kallisti5 11 years ago.
always-set-page-dir-on-context-switch.diff (1.1 KB ) - added by bonefish 11 years ago.
bash-6630-debug-25-12-2013-03-13-53.report (21.7 KB ) - added by kallisti5 11 years ago.
bash-6630-kdl.txt (13.5 KB ) - added by kallisti5 11 years ago.
IMG_20140118_103741.jpg (1.4 MB ) - added by kallisti5 11 years ago.: Not sure if this is related (if it isn't I can open a new bug) as of hrev46699 I get this kdl while compiling
bash-63838-debug-05-11-2014-06-10-19.report (8.5 KB ) - added by jessicah 10 years ago.: bash crashing on hrev48222
Terminal-238260-debug-15-11-2014-02-02-58.report (9.9 KB ) - added by jessicah 10 years ago.: Invalid opcode in Terminal, whilst running boost tests with -j4
Vision-369516-debug-06-06-2015-23-34-50.report (12.8 KB ) - added by jessicah 9 years ago.
syslog (125.8 KB ) - added by jessicah 9 years ago.: syslog in relation to Vision debug report
bash-873030-debug-07-06-2015-01-38-12.report (20.5 KB ) - added by jessicah 9 years ago.: And also bash from same session
python2.7-988-debug-06-04-2018-10-21-30.report (117.8 KB ) - added by nzimmermann 7 years ago.

Change History (152)

by kallisti5, 11 years ago

Attachment:	bash-71019-debug-01-12-2013-02-53-55.report added

by kallisti5, 11 years ago

Attachment:	cpp0-136338-debug-03-12-2013-18-14-34.report added

by kallisti5, 11 years ago

Attachment:	python2.6-9561-debug-04-12-2013-17-01-39.report added

follow-up: 2 comment:1 by bonefish, 11 years ago

Keywords:	vm added; random crash removed

The first two crashes (bash, cpp0) look quite similar. Both occur at the first instruction of a function, the one that pushes the old frame pointer on the stack. And in both cases the stack pointer looks like it refers to a valid address on the stack, so there shouldn't really be any cause for the crash. That points toward a VM issue.

Can't say anything wrt. the python debug report as it doesn't contain a disassembly (probably omitted due to the availability of debug info :-/).

FWIW, over the last days, since I've been using large RAM disks, I've seen two crashes myself. The first time it were actually simultaneous crashes of 8 make processes. The disassembly in the debug report suggests that the executed code has been replaced. Given that the code is 'mmap()'ed from the file, it's probably some mix-up of pages, i.e. most certainly a VM (or file cache) issue. Unfortunately I haven't kept the debug report of the second crash (bash).

The fact that I've seen two of those weird crashes in a few days, while I haven't seen anything like that for a long time, I suppose it has to do with my RAM disk use. I don't think the driver is to blame, though. It's really just a page consumer, so I consider it rather unlikely that it messes up other pages. It's probably more likely the fact that it causes quite a bit of memory use. Both crashes have been observed after quite a bit of run time and memory turnover.

in reply to: 1 ; follow-up: 3 comment:2 by anevilyak, 11 years ago

Replying to bonefish:

Can't say anything wrt. the python debug report as it doesn't contain a disassembly (probably omitted due to the availability of debug info :-/).

That can be changed if desired; the assumption was simply that knowing what line of source the crash occurred at + the current state of the frame variables rendered including the disassembly/frame dump redundant.

in reply to: 2 ; follow-up: 4 comment:3 by bonefish, 11 years ago

Replying to anevilyak:

That can be changed if desired; the assumption was simply that knowing what line of source the crash occurred at + the current state of the frame variables rendered including the disassembly/frame dump redundant.

The disassembly/frame data can be useful also when debug info is available -- particularly when the source code might not be at hand or the issues isn't actually related to the program (like in this ticket). The source file and line number are missing in the debug report as well, BTW.

in reply to: 3 comment:4 by anevilyak, 11 years ago

Replying to bonefish:

The disassembly/frame data can be useful also when debug info is available -- particularly when the source code might not be at hand or the issues isn't actually related to the program (like in this ticket). The source file and line number are missing in the debug report as well, BTW.

So they are, thought I'd implemented that but seemingly not. Added to TODO list in any case.

by bonefish, 11 years ago

Attachment:	make-162787-debug-01-12-2013-23-57-13.report added

debug report for make crash (gcc 4)

by kallisti5, 11 years ago

Attachment:	python2.6-26838-debug-09-12-2013-02-05-51.report added

python, gcc2h hrev46510

by kallisti5, 11 years ago

Attachment:	python2.6-69264-debug-09-12-2013-04-50-35.report added

python, gcc2h hrev46511

by kallisti5, 11 years ago

Attachment:	cpp0-98369-debug-09-12-2013-05-08-56.report added

cpp0, gcc2h hrev46511

comment:5 by kallisti5, 11 years ago

doesn't seem PAE related. I set the 4GB limit in the loader, same crashing issues.

comment:6 by bonefish, 11 years ago

With the introduction of NX support PAE is always used when available. So that option no longer disables PAE, I'm afraid.

follow-up: 8 comment:7 by bonefish, 11 years ago

The first of the recent python debug reports shows a segment violation while executing the instruction mov %esp, %ebp, which is obviously not possible. Might be a Debugger/debugging API glitch or there's actually something else going on. A kernel stack trace for the crashing thread would be helpful and a mapping <address> <team> for the fault address.

in reply to: 7 ; follow-up: 10 comment:8 by kallisti5, 11 years ago

Replying to bonefish:

With the introduction of NX support PAE is always used when available. So that option no longer disables PAE, I'm afraid.

Oh, that sucks. Any way it could be fixed? Otherwise, we should remove the loader option or comment it out.

Replying to bonefish:

The first of the recent python debug reports shows a segment violation while executing the instruction mov %esp, %ebp, which is obviously not possible. Might be a Debugger/debugging API glitch or there's actually something else going on. A kernel stack trace for the crashing thread would be helpful and a mapping <address> <team> for the fault address.

Will do, I've even seen the app_server crashing recently. So after the program crashes I should enter KDL land and do a backtrace?

comment:9 by kallisti5, 11 years ago

Milestone:	Unscheduled → R1/beta1
Priority:	normal → high

in reply to: 8 comment:10 by bonefish, 11 years ago

Replying to kallisti5:

Oh, that sucks. Any way it could be fixed? Otherwise, we should remove the loader option or comment it out.

It can certainly be fixed. Since PAE is machine specific and postdates our oldest target CPU (Pentium), there's a check somewhere that cause either PAE paging or 32 bit paging to be used. The boot loader setting would just have to be propagated to that point.

Will do, I've even seen the app_server crashing recently. So after the program crashes I should enter KDL land and do a backtrace?

I would enter Debugger, save the debug log manually (via menu), store the thread ID in short-term memory, enter KDL, do a bt <thread ID>. Also nice: in_context <thread ID> dis -b 10 (gets a disassembly for the thread's current instruction pointer starting 10 instructions before) and, in case of the page fault, a mapping <address> <thread ID>. The page fault address can be found in the syslog or inferred from the thread's current instruction and the register state.

by kallisti5, 11 years ago

Attachment:	python2.6-57810-debug-10-12-2013-03-36-41.report added

python test case 1 complete - report

by kallisti5, 11 years ago

Attachment:	python_kdl_bt.jpg added

python test case 1 complete - kdl bt

by kallisti5, 11 years ago

Attachment:	python_kdl_incontext.jpg added

python test case 1 complete - kdl in_context

by kallisti5, 11 years ago

Attachment:	python_kdl_mapping.jpg added

python test case 1 complete - kdl mapping

comment:11 by bonefish, 11 years ago

Oh my, I messed up the mapping command. Fixed in hrev46518. I also made the 4GB limit safemode option disable PAE.

follow-up: 13 comment:12 by kallisti5, 11 years ago

I disabled PAE, same crashes. mapping also seems still broken.

by kallisti5, 11 years ago

Attachment:	python2.6-25934-debug-11-12-2013-02-51-26.report added

python test case 2 complete - report

by kallisti5, 11 years ago

Attachment:	python2_kdl_bt.jpg added

python test case 2 complete - kdl bt

by kallisti5, 11 years ago

Attachment:	python2_kdl_incontext.jpg added

python test case 2 complete - kdl incontext

by kallisti5, 11 years ago

Attachment:	python2_kdl_mapping.jpg added

python test case 2 complete - kdl mapping

in reply to: 12 ; follow-up: 14 comment:13 by bonefish, 11 years ago

Replying to kallisti5:

mapping also seems still broken.

Works fine here (retested with hrev46523, both gcc 2 and gcc 4 builds). Could something have gone wrong with the kernel or package rebuild?

BTW, mapping isn't fully implemented for 32 bit (i.e. non-PAE) paging. It does the argument checks, but simply prints nothing afterward.

BTW 2, when leaving KDL the session is, after a few seconds delay, written to the syslog. Filtering the syslog through the Terminal (cat + copy'n'paste) to get rid of the control and escape sequences would be appreciated.

in reply to: 13 comment:14 by kallisti5, 11 years ago

Replying to bonefish:

Replying to kallisti5:

mapping also seems still broken.

Works fine here (retested with hrev46523, both gcc 2 and gcc 4 builds). Could something have gone wrong with the kernel or package rebuild?

Not sure, I did confirm in several places the new hrev as I saw the help text still looked like the old text.

Going to reboot, enter kdl and see if the help text for mapper is still the old team id vs thread id

BTW, mapping isn't fully implemented for 32 bit (i.e. non-PAE) paging. It does the argument checks, but simply prints nothing afterward.

Will reboot to PAE.

BTW 2, when leaving KDL the session is, after a few seconds delay, written to the syslog. Filtering the syslog through the Terminal (cat + copy'n'paste) to get rid of the control and escape sequences would be appreciated.

haha, you're right. that would be a lot better vs screenshots :D I'm used to fatal KDL's and not triggered ones.

by kallisti5, 11 years ago

Attachment:	cc1-8547-debug-11-12-2013-04-21-36.report added

cc1 test case 1 complete - report

by kallisti5, 11 years ago

Attachment:	cc_kdl_output.txt added

cc1 test case 1 complete - kdl

by kallisti5, 11 years ago

Attachment:	cc_kdl_output.2.txt added

cc1 test case 1 complete - kdl

comment:15 by kallisti5, 11 years ago

edited cc_kdl_output.2.txt to remove control chars and clear up the commands I ran.

comment:16 by bonefish, 11 years ago

This example is unfortunately not so interesting. The crashing instruction is movsxb 0x2(%edx), %eax with edx being 0. This value came from the stack (-0x18(%ebp), i.e. an argument to the function) and indeed the stack is 0 at that address. I suppose something went wrong here, too, but at least it cannot directly be linked to the VM issue the other crashes indicate.

by kallisti5, 11 years ago

Attachment:	bash-1323-debug-12-12-2013-02-19-06.report added

gcc test case 1 complete - report

by kallisti5, 11 years ago

Attachment:	bash_kdl.txt added

gcc test case 1 complete - kdl

by kallisti5, 11 years ago

Attachment:	Terminal-741-debug-12-12-2013-05-59-49.report added

terminal - report

by kallisti5, 11 years ago

Attachment:	terminal - kdl.txt added

terminal - kdl

comment:17 by bonefish, 11 years ago

For the bash crash a mapping for the fault address (i.e. the one that was accessed) rather than (or in addition to) the address of the faulting instruction would have been interesting.

The info from the Terminal is a bit inconclusive. Debugger reads the following code from the executable:

0x01fe6ce4:               55  push %ebp
0x01fe6ce5:             89e5  mov %esp, %ebp <--

The kernel's disassembly shows:

0x01fe6ce6:             e583      in $0x83, %eax
0x01fe6ce8:               ec      in %dx, %al

The current instruction pointer is not aligned with the actual instruction (pointing to the second byte of the mov). It is also possible that the what is actually mapped is not the data from the executable. The overlap between the two disassemblies is only one (matching) byte, so that doesn't verify that possibility, but doesn't refute it either.

Debugger wasn't able to get a stack trace; the kernel shows BLooper::DispatchMessage() as the caller of the crashing TermView::_UpdateSIGWINCH(). The latter's frame wasn't set up yet, so its actual caller, TermView::MessageReceived(), would be missing from the stack trace. So that doesn't tell us anything either.

by kallisti5, 11 years ago

Attachment:	cpp0-5832-debug-16-12-2013-02-26-00.report added

by kallisti5, 11 years ago

Attachment:	cpp0-kdl.txt added

by kallisti5, 11 years ago

Attachment:	bash-52196-debug-16-12-2013-04-16-50.report added

by kallisti5, 11 years ago

Attachment:	bash-kdl.txt added

by kallisti5, 11 years ago

Attachment:	bash-7856-debug-16-12-2013-04-51-50.report added

by kallisti5, 11 years ago

Attachment:	bash-7856-kdl.txt added

comment:18 by kallisti5, 11 years ago

I have noticed something interesting.. the following things generally crash while compiling:

gcc
cc0
python
sh / bash
app_server

The app_server crashes are pretty fatal, so I haven't gotten much info from them.

comment:19 by kallisti5, 11 years ago

just booted using the safe video mode to eliminate radeon_hd as a cause, still crashes.

At the moment running top from the command line seems to crash every time. EDIT: Nevermind about top, unrelated

Interesting side note, it seems like only things run from the command line crash.. i've never seen webpositive or any other applications crash in this way that aren't command line spawned.

Last edited 11 years ago by kallisti5 (previous) (diff)

follow-up: 21 comment:20 by bonefish, 11 years ago

In both of the latest three crashes an access to the stack caused the page fault. As written before, it would be good to get the mapping output for the fault address, not (only) for the IP. If you can't deduce the fault address from the disassembly, please look it up in the syslog before entering KDL -- there should be two "vm_page_fault:..." lines for the crash which both contain the fault address. Possibly rounded down to the page address, which doesn't matter, since mapping does the same.

by kallisti5, 11 years ago

Attachment:	bash-7459-debug-16-12-2013-21-15-21.report added

bash - report - mapping of vm_page_fault

by kallisti5, 11 years ago

Attachment:	bash-7459-kdl.txt added

bash - kdl - mapping of vm_page_fault

in reply to: 20 comment:21 by kallisti5, 11 years ago

Replying to bonefish:

In both of the latest three crashes an access to the stack caused the page fault. As written before, it would be good to get the mapping output for the fault address, not (only) for the IP. If you can't deduce the fault address from the disassembly, please look it up in the syslog before entering KDL -- there should be two "vm_page_fault:..." lines for the crash which both contain the fault address. Possibly rounded down to the page address, which doesn't matter, since mapping does the same.

Ah, ok. I didn't pick that up before. Done and attached.

Thanks for looking at this! It's been driving me nuts :-)

comment:22 by bonefish, 11 years ago

0x15661ec was listed as the fault address? Given that the instruction was a push %ebp and the stack pointer was 0x72ce8ce4, I'd have expected 0x72ce8ce0 as the fault address.

comment:23 by ttcoder, 11 years ago

Cc:	degea@… added

/me wonders if it's normal to have a NULL frame pointer as per the userland report:

Frame       IP          Function Name
-----------------------------------------------
00000000    0x1534b70   init_yy_io + 0

comment:24 by kallisti5, 11 years ago

KERN: write access attempted on write-protected area 0x309c3 at 0x01566000
KERN: vm_page_fault: vm_soft_fault returned error 'Permission denied' on fault at 0x15661ec, ip 0x1534b70, write 1, user 1, thread 0x1d23
KERN: vm_page_fault: thread "sh" (7459) in team "sh" (7459) tried to write address 0x15661ec, ip 0x1534b70 ("bash_seg0ro" +0x26b70)
KERN: debug_server: Thread 7459 entered the debugger: Segment violation

comment:25 by kallisti5, 11 years ago

as an fyi, I just did a memtest86+ round on the machine seeing this issue and no hardware errors were reported.

comment:26 by bonefish, 11 years ago

Mmh, curious. That means that the instruction that has actually been executed is not the one Debugger and later the kernel see. Please copy the dprintf() from http://cgit.haiku-os.org/haiku/tree/src/system/kernel/vm/vm.cpp#n4235 and insert a modified version farther below before the user_debug_exception_occurred() invocation at http://cgit.haiku-os.org/haiku/tree/src/system/kernel/vm/vm.cpp#n4308. In the new line the dprintf would be replaced by panic and it should look like || (panic(...), false) to work in the if condition.

This will cause the kernel to panic immediately when a userland team crashes, so we'll hopefully see what was actually executed. The information to retrieve would be the same: mapping for both fault address and IP (both in the panic message), dis -- for neither command the thread has to be specified anymore (i.e. omit the in_context <thread> before dis) -- and the debug report that can be captured after leaving KDL.

comment:27 by kallisti5, 11 years ago

@bonefish Is the attached patch "good enough"? Tried to change everything you were looking for but got a little confused on the || (panic(...), false) bit as it seemed a little overly complex for what you were looking for.

Going to place image onto a usb disk, and try compiling something on the test machine to trigger.

by kallisti5, 11 years ago

Attachment:	vm-page-fault-diff added

comment:28 by kallisti5, 11 years ago

patch:	0 → 1

by kallisti5, 11 years ago

Attachment:	bash-14580-debug-17-12-2013-22-11-29.report added

by kallisti5, 11 years ago

Attachment:	bash-7459-kdl.2.txt added

by kallisti5, 11 years ago

Attachment:	bash-14580-kdl-special.txt added

comment:29 by kallisti5, 11 years ago

bash-14580-debug-17-12-2013-22-11-29.report and bash-14580-kdl-special.txt attached.

The addresses in the vm_page_fault were 0x0 (obviously an issue there), so i'm not sure the mapping address i picked was correct. so I went with "0x81fd3ea8" from kernel_debuggger_loop (not sure if that was a pointer to data, or the actual pointer to the fault address)

follow-up: 31 comment:30 by bonefish, 11 years ago

Your patch is OK in principle. It may, however, cause false positives. The first subconditions of the if check whether a SIGSEGV handler is installed for the team. Some programs (particularly programming language interpreters -- AFAIK OpenJDK, not sure about python) use those as a tool for their regular memory management (e.g. enlarging allocations lazily).

Reentering KDL after getting the debug report is not necessary (in case that was what you intended to attach -- "bash-7459-kdl.2.txt" seems to be old). In the first KDL session we get the information we need (and "fresher").

Omitting the in_context <thread> for the dis is somewhat important, though. We are already in the context of the thread anyway and we want to verify whether the CPU currently sees what it should see. The in_context resets the paging structures and may thus hide the problem.

We don't need a bt, as the panic already prints one. And finally, the mapping is needed for the two addresses printed in the panic message (the read or write address and the IP). Should it have been overwritten on screen by the stack trace, you can reprint the message via message. For a NULL pointer, as in this case, mapping is not needed.

in reply to: 30 comment:31 by kallisti5, 11 years ago

Replying to bonefish:

Your patch is OK in principle. It may, however, cause false positives. The first subconditions of the if check whether a SIGSEGV handler is installed for the team. Some programs (particularly programming language interpreters -- AFAIK OpenJDK, not sure about python) use those as a tool for their regular memory management (e.g. enlarging allocations lazily).

Yeah, that was the one edge case I saw.. however definitely not something i'm going to commit or run for more than 10 minutes.

Reentering KDL after getting the debug report is not necessary (in case that was what you intended to attach -- "bash-7459-kdl.2.txt" seems to be old). In the first KDL session we get the information we need (and "fresher").

Sorry, ignore bash-7459-kdl.2.txt, that was an accidental upload.

Omitting the in_context <thread> for the dis is somewhat important, though. We are already in the context of the thread anyway and we want to verify whether the CPU currently sees what it should see. The in_context resets the paging structures and may thus hide the problem.

We don't need a bt, as the panic already prints one. And finally, the mapping is needed for the two addresses printed in the panic message (the read or write address and the IP). Should it have been overwritten on screen by the stack trace, you can reprint the message via message. For a NULL pointer, as in this case, mapping is not needed.

yeah, the bt was mostly to get the tid as it got cut off, next time i'll run message.

Going to do another test here in a few minutes and will hopefully have one more set of logs that will be more useful.

by kallisti5, 11 years ago

Attachment:	bash-24569-kdl.txt added

by kallisti5, 11 years ago

Attachment:	bash-15133-kdl.txt added

by kallisti5, 11 years ago

Attachment:	bash-6976-debug-18-12-2013-01-32-14.report added

by kallisti5, 11 years ago

Attachment:	bash-6976-kdl.txt added

comment:32 by kallisti5, 11 years ago

I personally think bash-24569-kdl.txt is the most interesting as it had a solid memory address behind the page fault. I forgot to collect a report for 24569 and 15133, hopefully they are still useful.

comment:33 by bonefish, 11 years ago

I'm still hoping for a dis.

by kallisti5, 11 years ago

Attachment:	sh-13815-kdl.txt added

comment:34 by bonefish, 11 years ago

Thanks. Unfortunately that only verifies that the thread sees the correct code already at that point. It is obvious that either other code must have run or the register value were different, since the push %ebp cannot cause the page fault at 0x0 with esp being 0x71a35064. That the register value we see is incorrect is highly unlikely -- it is dumped on the kernel stack right when the page fault occurs.

That leaves the code. I don't think this is a problem at the paging level. At least I don't see how the thread runs the correct code, then one or more pages are mapped incorrectly, and immediately thereafter things look correct again. ATM the only explanation I can think of is that there's a race condition that can cause the thread to temporarily run in the wrong address space. Will think about how to verify that theory.

comment:35 by kallisti5, 11 years ago

hmmmmmmmmm. I disabled all but one cpu core (the first one) in Pulse, 30 minutes of compiling and no crash :-\

comment:36 by kallisti5, 11 years ago

I should mention that I don't have to do a parallel build or something to trigger these crashes. Each thread being spun up on a random core seems to be enough to trigger it. Once this build with a single core finishes, i'll bump it up to two cores to see if it still seems stable.

comment:37 by kallisti5, 11 years ago

enabling as many as two cpus results in the crashes starting. With one cpu core enabled, haven't seen a crash yet after several extremely long builds. Definitely something with multiple cpus accessing memory.

by bonefish, 11 years ago

Attachment:	always-set-page-dir-on-context-switch.diff added

comment:38 by bonefish, 11 years ago

Alex, could you try the attached patch. It removes an optimization, trying (at least partially) to verify my latest theory.

comment:39 by kallisti5, 11 years ago

no change it seems. building ncurses crashed just as quickly. Attaching a kdl + report incase it helps from the image modified with changes above.

by kallisti5, 11 years ago

Attachment:	bash-6630-debug-25-12-2013-03-13-53.report added

by kallisti5, 11 years ago

Attachment:	bash-6630-kdl.txt added

comment:40 by waddlesplash, 11 years ago

Any way to test this in VirtualBox? I've been running Haiku in a 2-core 1.5GB VM, no crashes. If I increase memory to 4097MB do the crashes start?

comment:41 by SeanCollins, 11 years ago

I see the same behavior as kallisti 5, Is this on AMD or intel hardware ?

follow-up: 44 comment:42 by kallisti5, 11 years ago

AMD bulldozer here. 8 cores, 16 GB ram.

follow-up: 45 comment:43 by bonefish, 11 years ago

Intel Core i7-860 16 GB RAM: So far I've only seen two crashes that might be related, and only after quite some time and memory turnover. So the issue may not be machine specific but it's obviously a lot easier to reproduce on some machines.

It looks like a race condition on multi-CPU hardware. It would be interesting to see whether Pawel's scheduler branch fixes it. He has changed quite a bit of low-level locking.

in reply to: 42 comment:44 by SeanCollins, 11 years ago

Replying to kallisti5:

AMD bulldozer here. 8 cores, 16 GB ram.

Same here, and if it means anything, I find this bug far far harder to induce on Intel hardware. I wonder if there is a memory controller or cache difference that could explain this behavoir ? Thinking back, I never saw this issue as much on the phenomII hardware, as soon as I upgraded to 8350 I started having trouble.

Last edited 11 years ago by SeanCollins (previous) (diff)

in reply to: 43 comment:45 by kallisti5, 11 years ago

Replying to bonefish:

It looks like a race condition on multi-CPU hardware. It would be interesting to see whether Pawel's scheduler branch fixes it. He has changed quite a bit of low-level locking.

Yeah.. That test is next on my to do list unless Sean beats me to it :-)

comment:46 by kallisti5, 11 years ago

nope, seems the scheduler branch currently doesn't compile:

Link /home/kallisti5/code/haiku-scheduler/generated.x86_gcc2/objects/haiku/x86_gcc2/release/system/libroot/libroot.so 
/home/kallisti5/code/haiku-scheduler/generated.x86_gcc2/objects/haiku/x86_gcc2/release/system/libroot/libroot_init.o: In function `initialize_before':
libroot_init.c:(.text+0xe3): undefined reference to `get_system_info'
/home/kallisti5/code/haiku-scheduler/generated.x86_gcc2/objects/haiku/x86_gcc2/release/system/libroot/os/os_main.o: In function `_get_system_info':
(.text+0x4851): undefined reference to `get_cpu_topology_info'
/home/kallisti5/code/haiku-scheduler/generated.x86_gcc2/objects/haiku/x86_gcc2/release/system/libroot/os/os_main.o: In function `_get_system_info':
(.text+0x48a8): undefined reference to `get_cpu_topology_info'
/home/kallisti5/code/haiku-scheduler/generated.x86_gcc2/objects/haiku/x86_gcc2/release/system/libroot/posix/sys/posix_sys.o: In function `uname':
(.text+0x15b7): undefined reference to `get_system_info'
/home/kallisti5/code/haiku-scheduler/generated.x86_gcc2/objects/haiku/x86_gcc2/release/system/libroot/posix/sys/posix_sys.o: In function `uname':
(.text+0x165d): undefined reference to `get_cpu_topology_info'
/home/kallisti5/code/haiku-scheduler/generated.x86_gcc2/objects/haiku/x86_gcc2/release/system/libroot/posix/unistd/posix_unistd.o: In function `__sysconf':
(.text+0x4d3): undefined reference to `get_system_info'
/home/kallisti5/code/haiku-scheduler/generated.x86_gcc2/objects/haiku/x86_gcc2/release/system/libroot/posix/unistd/posix_unistd.o: In function `__sysconf':
(.text+0x505): undefined reference to `get_system_info'
/home/kallisti5/code/haiku-scheduler/generated.x86_gcc2/objects/haiku/x86_gcc2/release/system/libroot/posix/unistd/posix_unistd.o: In function `__sysconf':
(.text+0x55b): undefined reference to `get_system_info'
/home/kallisti5/code/haiku-scheduler/generated.x86_gcc2/objects/haiku/x86_gcc2/release/system/libroot/posix/unistd/posix_unistd.o: In function `__sysconf':
(.text+0x58b): undefined reference to `get_system_info'
/home/kallisti5/code/haiku-scheduler/generated.x86_gcc2/objects/haiku/x86_gcc2/release/system/libroot/posix/malloc/posix_malloc.o: In function `BPrivate::hoardHeap::initNumProcs(void)':
(.text+0x370e): undefined reference to `get_system_info'
collect2: ld returned 1 exit status

comment:47 by kallisti5, 11 years ago

just an fyi... tried the scheduler branch after Pawel pointed out a needed cherry-pick. It actually makes things worse and crash easier :-\

follow-up: 50 comment:48 by bonefish, 11 years ago

That's a good thing, if that's also true on other hardware. The issue isn't really practical to debug, if it is as hard to reproduce as it is now.

comment:49 by kallisti5, 11 years ago

as a side note, i've been testing this on gcc2h so far. Installed gcc4h on the machine this evening... same issue :-)

in reply to: 48 ; follow-up: 51 comment:50 by SeanCollins, 11 years ago

Replying to bonefish:

That's a good thing, if that's also true on other hardware. The issue isn't really practical to debug, if it is as hard to reproduce as it is now.

I think Kallisti5 should hit up his amd engineering cohort on the amd driver work, and see if they can send you a bulldozer cpu, becuase I can reproduce this easily. The glaring difference I see is intel vrs amd.

in reply to: 50 ; follow-up: 53 comment:51 by kallisti5, 11 years ago

Replying to SeanCollins:

Replying to bonefish:

That's a good thing, if that's also true on other hardware. The issue isn't really practical to debug, if it is as hard to reproduce as it is now.

I think Kallisti5 should hit up his amd engineering cohort on the amd driver work, and see if they can send you a bulldozer cpu, becuase I can reproduce this easily. The glaring difference I see is intel vrs amd.

Haha, I don't want to abuse my connections. Honestly if buying bonefish a bulldozer cpu + motherboard is what it takes, I have no issues doing so :-) (or maybe a good use of Haiku funds?)

Sean, is yours an 8-core? Wonder how cheap of a CPU we could get away with that sees this issue.

by kallisti5, 11 years ago

Attachment:	IMG_20140118_103741.jpg added

Not sure if this is related (if it isn't I can open a new bug) as of hrev46699 I get this kdl while compiling

comment:52 by bonefish, 11 years ago

Looks like ordinary NULL pointer dereferencing in the signal code. Since Pawel has touched that code, it's very likely that this is a new and unrelated issue. Please open a new ticket and assign it to him.

in reply to: 51 comment:53 by bonefish, 11 years ago

Replying to kallisti5:

Replying to SeanCollins:

I think Kallisti5 should hit up his amd engineering cohort on the amd driver work, and see if they can send you a bulldozer cpu, becuase I can reproduce this easily. The glaring difference I see is intel vrs amd.

Haha, I don't want to abuse my connections. Honestly if buying bonefish a bulldozer cpu + motherboard is what it takes, I have no issues doing so :-) (or maybe a good use of Haiku funds?)

FWIW, I wouldn't mind a temporary hardware loan for debugging the issue -- I can ship it back afterward as I don't really need it otherwise. Since I don't want to disassemble my main development machine and don't have all the parts to build another one, just a board and a CPU wouldn't help that much, though. If anyone has a laptop to spare for a while, that would be pretty perfect (one with a serial port would be divine, but I have no illusions :-)).

Other than that, it would be interesting whether the bug can also be reproduced in a virtual machine. That might allow for a remote debugging setup.

comment:54 by kallisti5, 11 years ago

hm.. actually i've yet to see *this* crash after the scheduler merge. All i've seen so far is #10433

comment:55 by anevilyak, 11 years ago

It might be fixed, but If the problem was in some way cache or TLB-related, it's possible that the new scheduler is simply hiding it by virtue of properly supporting affinity, as with the previous scheduler the thread would have constantly been core hopping, which is no longer the case. It would probably be worth analyzing it on a revision where it still occurs just to be sure, as if it is in fact still present, it might simply crop up in more subtle or harder to analyze ways now.

comment:56 by kallisti5, 11 years ago

nope. Just occured. hrev46711 Tried to grab a report but the entire system locked (no kdl)

Triggered a kdl manually and looked at syslog:

vm_soft_fault: va 0x0 not covered by area in address space
vm_page_fault: vm_soft_fault returned `Bad address` on fault at 0x80, ip, 0x80, write 0, user 1, thread 0x61a8
vm_page_fault: thread "sh" (25000) in team "sh" (25000) tried to execute address 0x80, ip 0x80 ("???" +0x80)

I think the KDL while generating a report may be due to another scheduler bug.

comment:57 by kallisti5, 11 years ago

opened #10446 as it could be another scheduler bug.

comment:58 by SeanCollins, 11 years ago

@kallisti5 amd 8350 cpu

comment:59 by kallisti5, 11 years ago

Description:	modified (diff)

comment:60 by kallisti5, 11 years ago

Milestone:	R1/beta1 → R1/alpha5
Priority:	high → critical

as this affects all users of modern AMD hardware, i'm putting this on alpha5 for now. (if we can't fix it this one may be bumped to beta1)

comment:61 by kallisti5, 10 years ago

Just saw this in a VM running kvm on an AMD system. (hrev47345, x86 gcc4h)

building ncurses with haikuporter

vm_page_fault: vm_soft_fault returned error 'Bad address' on fault at 0x70d35f2c, ip 0x179fa16, write 1, user 1, thread 0x6c71
vm_page_fault: thread "tic" (27761) in team "tic" (27761) tried to write address 0x70d35f2c, ip 0x179fa16 ("libroot.so_seg0ro" +0x71a16)
debug_server: Thread 27761 entered the debugger: Segment violation
stack trace, current PC 0x179fa16  _IO_vfprintf + 0xc:
  (0x70d38918)  0x178b346  vsprintf + 0x5d
  (0x70d389f8)  0x179ec7b  sprintf + 0x21
  (0x70d38a18)  0x275a120  _nc_read_entry + 0x30
  (0x70d38a68)  0x275e956  _nc_resolve_uses2 + 0x326
  (0x70d38ef8)  0x275a451  _nc_read_termcap_entry + 0x151
  (0x70d397b8)  0x275a259  _nc_read_entry + 0x169
  (0x70d39808)  0x275e956  _nc_resolve_uses2 + 0x326
  (0x70d39c98)  0x275a451  _nc_read_termcap_entry + 0x151
  (0x70d3a558)  0x275a259  _nc_read_entry + 0x169
  (0x70d3a5a8)  0x275e956  _nc_resolve_uses2 + 0x326
  (0x70d3aa38)  0x275a451  _nc_read_termcap_entry + 0x151
  (0x70d3b2f8)  0x275a259  _nc_read_entry + 0x169
  (0x70d3b348)  0x275e956  _nc_resolve_uses2 + 0x326
  (0x70d3b7d8)  0x275a451  _nc_read_termcap_entry + 0x151
  (0x70d3c098)  0x275a259  _nc_read_entry + 0x169
  (0x70d3c0e8)  0x275e956  _nc_resolve_uses2 + 0x326
  (0x70d3c578)  0x275a451  _nc_read_termcap_entry + 0x151
  (0x70d3ce38)  0x275a259  _nc_read_entry + 0x169
  (0x70d3ce88)  0x275e956  _nc_resolve_uses2 + 0x326
  (0x70d3d318)  0x275a451  _nc_read_termcap_entry + 0x151
  (0x70d3dbd8)  0x275a259  _nc_read_entry + 0x169
  (0x70d3dc28)  0x275e956  _nc_resolve_uses2 + 0x326
  (0x70d3e0b8)  0x275a451  _nc_read_termcap_entry + 0x151
  (0x70d3e978)  0x275a259  _nc_read_entry + 0x169
  (0x70d3e9c8)  0x275e956  _nc_resolve_uses2 + 0x326
  (0x70d3ee58)  0x275a451  _nc_read_termcap_entry + 0x151
  (0x70d3f718)  0x275a259  _nc_read_entry + 0x169
  (0x70d3f768)  0x275e956  _nc_resolve_uses2 + 0x326
  (0x70d3fbf8)  0x275a451  _nc_read_termcap_entry + 0x151
  (0x70d404b8)  0x275a259  _nc_read_entry + 0x169
  (0x70d40508)  0x275e956  _nc_resolve_uses2 + 0x326
  (0x70d40998)  0x275a451  _nc_read_termcap_entry + 0x151
  (0x70d41258)  0x275a259  _nc_read_entry + 0x169
  (0x70d412a8)  0x275e956  _nc_resolve_uses2 + 0x326
  (0x70d41738)  0x275a451  _nc_read_termcap_entry + 0x151
  (0x70d41ff8)  0x275a259  _nc_read_entry + 0x169
  (0x70d42048)  0x275e956  _nc_resolve_uses2 + 0x326
  (0x70d424d8)  0x275a451  _nc_read_termcap_entry + 0x151
  (0x70d42d98)  0x275a259  _nc_read_entry + 0x169
  (0x70d42de8)  0x275e956  _nc_resolve_uses2 + 0x326
  (0x70d43278)  0x275a451  _nc_read_termcap_entry + 0x151
  (0x70d43b38)  0x275a259  _nc_read_entry + 0x169
  (0x70d43b88)  0x275e956  _nc_resolve_uses2 + 0x326
  (0x70d44018)  0x275a451  _nc_read_termcap_entry + 0x151
  (0x70d448d8)  0x275a259  _nc_read_entry + 0x169
  (0x70d44928)  0x275e956  _nc_resolve_uses2 + 0x326
  (0x70d44db8)  0x275a451  _nc_read_termcap_entry + 0x151
  (0x70d45678)  0x275a259  _nc_read_entry + 0x169
  (0x70d456c8)  0x275e956  _nc_resolve_uses2 + 0x326
  (0x70d45b58)  0x275a451  _nc_read_termcap_entry + 0x151

Last edited 10 years ago by kallisti5 (previous) (diff)

follow-up: 63 comment:62 by jessicah, 10 years ago

From the original bug report description, I'm still seeing instances of this a LOT. I have a folder full of debug reports :) 29 so far. The highest majority affect bash, but I've had crashes in as, grep (might be unrelated), ld, media_addon_server, TextSearch, and Tracker.

Most are segment violations, some are invalid opcode exceptions, and in Tracker, general protection fault.

This is on hrev47582, with an i3-4010U CPU. I can upload the reports if you think they'd be useful. Hardly any of them crash in identical places...

in reply to: 62 comment:63 by jprostko, 10 years ago

Replying to jessicah:

From the original bug report description, I'm still seeing instances of this a LOT. I have a folder full of debug reports :) 29 so far. The highest majority affect bash, but I've had crashes in as, grep (might be unrelated), ld, media_addon_server, TextSearch, and Tracker.

I can crash grep quite easily, as using "-ir" to do a recursive, case-insensitive search crashes grep for me 100% of the time. Not sure if you are experiencing similar things, but figured I'd point that out. It might deserve its own ticket.

follow-up: 65 comment:64 by kallisti5, 10 years ago

@jprostko I think that's a bug in grep. Using -i results in a crash on every run.

in reply to: 64 comment:65 by jprostko, 10 years ago

Replying to kallisti5:

@jprostko I think that's a bug in grep. Using -i results in a crash on every run.

Yeah, I kind of figured that. I guess that's a bug report for HaikuPorts then.

comment:66 by pulkomandy, 10 years ago

Description:	modified (diff)

There already is a bugreport for the grep -i issue: https://dev.haiku-os.org/ticket/9342

comment:67 by pulkomandy, 10 years ago

Description:	modified (diff)

For the record, I'm getting similar symptoms on my Core i3 machine. After a few hours of uptime and compilation of packages with haikuporter, I get random "invalid opcode exception" on the first push instruction of functions.

comment:68 by waddlesplash, 10 years ago

With the recent kernel fixes: @pulkomandy @kallisti5 have either of you seen this recently?

comment:69 by jessicah, 10 years ago

I've experienced a crash in bash still. But I don't know if it's the same type of crash.

by jessicah, 10 years ago

Attachment:	bash-63838-debug-05-11-2014-06-10-19.report added

bash crashing on hrev48222

comment:70 by pulkomandy, 10 years ago

Does not look similar, this is a "normal" crash (out of bounds access in strcpy - people are still using strcpy?) and probably an actual bug in bash.

I'll be watching for this next time I do some haikuporter work, which is usually a good way to trigger it here.

comment:71 by pulkomandy, 10 years ago

Milestone:	R1/alpha5 → R1/beta1

by jessicah, 10 years ago

Attachment:	Terminal-238260-debug-15-11-2014-02-02-58.report added

Invalid opcode in Terminal, whilst running boost tests with -j4

comment:72 by waddlesplash, 10 years ago

With all the kernel fixes in the past 4mo, has anyone seen this recently?

comment:73 by pulkomandy, 10 years ago

I didn't meet this problem for some time. However, don't close this until someone with an AMD Bulldozer CPU has reported it's ok (I have only Intel machines here).

comment:74 by kallisti5, 10 years ago

I attempted to run a recent x86_64 haiku on my 8-core bulldozer and had hard lockup issues. Might be unrelated... but it has prevented me from getting in the workload to try and reproduce this one.

As soon as I have some free-time i'll hook up a serial cable to my main desktop and see if I can document / solve the hard lockup issue.

comment:75 by jessicah, 9 years ago

I've just had a definitive experience of this again. I've had invalid opcode type debug reports all the time still, but this one is clearly something going awry.

attachment:Vision-369516-debug-06-06-2015-23-34-50.report attachment:bash-873030-debug-07-06-2015-01-38-12.report attachment:syslog

Granted, I was running a boost regression testsuite, a texlive package build, and a libqt4 package build concurrently... :p

On hrev49248, x86_gcc2 hybrid. Intel i5-3570K w/8GB RAM.

Last edited 9 years ago by jessicah (previous) (diff)

by jessicah, 9 years ago

Attachment:	Vision-369516-debug-06-06-2015-23-34-50.report added

by jessicah, 9 years ago

Attachment:	syslog added

syslog in relation to Vision debug report

by jessicah, 9 years ago

Attachment:	bash-873030-debug-07-06-2015-01-38-12.report added

And also bash from same session

comment:76 by kallisti5, 9 years ago

Bah. We really can't have things crashing under heavy workloads (esp if we start automating recipe builds)

Do we want to call this one a blocker to R1 beta 1?

comment:77 by pulkomandy, 9 years ago

Is this still a problem? Things are now very stable for me (I'be been building webkit for the whole day, and did quite a lot of stuff during the coding sprint too) without ever hitting a crash; but I'm on intel hardware.

comment:78 by kallisti5, 9 years ago

I actually haven't seen this for some time on an i5 x86_64 compiling things like LLVM. jessicah mentioned still seeing it, but maybe she's seeing another issue?

comment:79 by kallisti5, 9 years ago

Just realized that this is a bug against AMD and not Intel. I'll confirm this evening if it is still happening.

comment:80 by SeanCollins, 9 years ago

I am attempting to reproduce, so far so good with a lot of system stress and compiling going on in the background. will report back in a few days. Looks fixed so far. the good news is that I have not changed ANY of the hardware in my machine since I first commented nearly 3 years ago. so I should have one of the better test cases.

Last edited 9 years ago by SeanCollins (previous) (diff)

comment:81 by SeanCollins, 9 years ago

I haven't been able to reproduce this bug and I use to be able to hit it ALL the TIME. It is very likely that some of the memory fixes and scheduler work may have fixed this for me.

AMD 8350 cpu

comment:82 by kallisti5, 8 years ago

I've changed mainboards (same cpu tho). This issue has gone away but an even worse #12671 has popped up (990FX chipset)

I think it's safe to call this one resolved?

comment:83 by jessicah, 8 years ago

I still get spurious crashes now and then; although my system is Intel. Just haven't uploaded any of my debug reports in a while. Suppose I should do that... maybe the new core dump feature might be useful if it happens for me again :)

comment:84 by kallisti5, 8 years ago

patch:	1 → 0

comment:85 by korli, 8 years ago

Is this bug also on x86_64?

comment:86 by axeld, 8 years ago

Owner:	changed from axeld to nobody
Status:	new → assigned

So any status update here? Otherwise we could close it.

comment:87 by axeld, 8 years ago

Blocking:	12796 added

(In #12796) While it may be helpful to describe the situation when an unexpected crash happens, it doesn't necessarily mean it causes it: "Full moon causes addattr crash", "Sunrise brings down Deskbar", ... :-)

Anyway, this sounds a lot like #10279.

comment:88 by waddlesplash, 7 years ago

Milestone:	R1/beta1 → Unscheduled

Nobody seems to have seen this in some time; removing from b1.

comment:89 by korli, 7 years ago

Still seen every now and then, especially when making mass update with haikuporter: for instance four builders on 4 CPUs; the more chroots there are, the easier to reproduce.

comment:90 by waddlesplash, 7 years ago

There were a variety of potential races in team creation that should now be resolved in hrev51700, some of which involved pointers being initialized while other kernel threads already had access to the data structure.

I have no idea of the amount of (bad) luck it would take for such a race to collide with some other (legitimate and writable) chunk of accessed memory ... but given that I heard reports from users that the first change to that code possibly cleared up some other KDLs, perhaps it's somehow related.

comment:91 by nzimmermann, 7 years ago

I see sporadic crashes on a AMD FX-6100 six-core processor, when performing compilation of larger projects, mostly in bash (but only when using make -j 6, not in single core builds).

by nzimmermann, 7 years ago

Attachment:	python2.7-988-debug-06-04-2018-10-21-30.report added

follow-up: 93 comment:92 by nzimmermann, 7 years ago

The python crash happend during running haikuporter on a fresh initialized git repo. I then re-ran hp and it resumed just fine.

This happend to me quite often on bare metal using this AMD machine, but never inside VirtualBox.

in reply to: 92 ; follow-up: 95 comment:93 by korli, 7 years ago

Replying to Nikolas Zimmermann:

The python crash happend during running haikuporter on a fresh initialized git repo. I then re-ran hp and it resumed just fine.

You could try with 4gb_memory_limit (available in the bootloader).

comment:94 by nzimmermann, 7 years ago

Thanks @korli, unfortunately that didn't help. Same issue.

But it's very hard to track this issue for me at present, because my machine also reboots instantenous under heavy load, e.g. make -j 6 on my AMD 6-core system.

I setup debugging via COM port and captured the syslog on another machine, but unfortunately there were no valuable information at the time of reboot - syslog just looks fine, no hints in the right direction.

As my test machine was a spare one "from the attic", I now replaced main board, power, RAM, etc. but the problem persists - only the CPU stayed the same. I will install Arch Linux tomorrow, and see if I can reproduce the reboot under heavy load, maybe I'm facing bad hardware...

in reply to: 93 comment:95 by nzimmermann, 7 years ago

Replying to korli:

Replying to Nikolas Zimmermann:

The python crash happend during running haikuporter on a fresh initialized git repo. I then re-ran hp and it resumed just fine.

You could try with 4gb_memory_limit (available in the bootloader).

Regarding my random reboot problems, that make debugging so hard: By now, I also replaced the CPU, played around with all kind of BIOS settings (voltages, frequencies, etc.), and all combinations of the safe mode boot options.

The 4gb_memory_limit helps me! When using that option, I no longer experience the random reboots. The random binary crashes are still present under heavy load - but at least my machine doesn't reboot all ~10 minutes anymore!

Syslog entries from the crash runs: 2018-04-12 10:46:55 KERN: using PAE paging 2018-04-12 10:54:34 KERN: using PAE paging 2018-04-12 12:12:28 KERN: using PAE paging 2018-04-12 12:21:04 KERN: using PAE paging ...

Stable runs: 2018-04-12 12:26:09 KERN: using 32 bit paging (PAE not available) 2018-04-12 12:42:09 KERN: using 32 bit paging (PAE not available) 2018-04-12 13:01:41 KERN: using 32 bit paging (PAE not available) ..

comment:96 by nzimmermann, 7 years ago

Blocked By:	14082 added

comment:97 by nzimmermann, 7 years ago

Filed #14082 to further track the PAE paging problem.

comment:98 by tidux, 6 years ago

I just tried Beta 1 x86_64 on an FX-8350 system with 16GB RAM and got KDLs right from the live USB environment. The same USB stick boots and works (except for the usual lack of sound) on my other system.

comment:99 by waddlesplash, 6 years ago

KDLs are not random binary crashes, so, that is totally unrelated to this ticket.

comment:100 by waddlesplash, 6 years ago

Has anyone seen this recently?

comment:101 by ttcoder, 6 years ago

Cc:	ttcoder added; degea@… removed

comment:102 by waddlesplash, 6 years ago

Blocked By:	14082 removed
Resolution:	→ fixed
Status:	assigned → closed

Given that the remaining issues on Bulldozer now have their own ticket (and are triple faults, not random application crashes), I think this one can be closed, finally.

comment:103 by nielx, 5 years ago

Milestone:	Unscheduled → R1/beta2

Assign tickets with status=closed and resolution=fixed within the R1/beta2 development window to the R1/beta2 Milestone

Note: See TracTickets for help on using tickets.

Download in other formats: