Opened 14 years ago
Closed 9 years ago
#7508 closed bug (fixed)
Enabling multiple CPU core cuases system instability during builds etc.
Reported by: | stargatefan | Owned by: | axeld |
---|---|---|---|
Priority: | normal | Milestone: | R1 |
Component: | System/Kernel | Version: | R1/alpha2 |
Keywords: | Cc: | ||
Blocked By: | Blocking: | ||
Platform: | All |
Description
I need to know how to properly log this, as of late I have been having alot of problems with building haiku, lots of crash, incomplete build, files not being found etc. Out of sheer random curiosity " trying to test things " to look for correlation, I found that by using pulse to disable 5 of the 6 cpu core in my system, that I can now build haiku, on haiku.
I want to say I started noticing regressions with this about 4-6 weeks ago, I figured it was that haiku didn't like my hardware or some other bizzare problem. anyways, I have sucessfully completed my first build without a gdb or KDL in about 4 weeks so something is afoot, I have checked the backtraces etc and all the library call problems I have seen aren't really a problem. I can find verify and confirm the librarys work
this led me to replacing what I assumed was bad ram. If you pull some other tickets I have posted about similar problem comments were made about possiable hardware problems. I have replaced everything in the PC at this point, still I have random problems etc.
So I would like to help find this bug, please tell me what tools, utilitys,etc that need to be applied and what tests need to be performed to discover the issue. I think the issue is predominately tied to AMD cpu's and the memory controller.
Tell me what I need to do to help you developers find this bug and I will do it quickly.
Attachments (9)
Change History (37)
comment:1 by , 14 years ago
Component: | - General → System/Kernel |
---|---|
Owner: | changed from | to
follow-up: 6 comment:2 by , 14 years ago
comment:3 by , 14 years ago
Replying to stargatefan:
Tell me what I need to do to help you developers find this bug and I will do it quickly.
I'd recommend to first verify whether it's a hardware problem by trying an older Haiku revision (to be on the safe side rather go back a few more weeks/months than you think). If you also have issues with that version and didn't back then, a hardware problem is most likely.
Otherwise it's obviously a Haiku regression. Depending on how long it takes you to verify whether an image does or doesn't have the issue, bisecting the revision range to track down the revision that introduced the problem, would be a relatively easy and very helpful way. Covering the last 1000 revisions (pretty exactly three months) takes about 10 steps. If you start with a narrower range -- e.g. ignoring the relatively busy last few days/weeks -- you can maybe save one step. Using the nightlies should save some time (a two-partition setup is recommended in this case -- well, actually in either case).
Other than that the KDL stack traces would be helpful info.
follow-up: 5 comment:4 by , 14 years ago
Looking at his syslogs, could it be that his bfs partition is corrupted?
cf. https://dev.haiku-os.org/attachment/ticket/5551/syslog.11#L3369
USER: failed to create the required index for attribute BEOS:LOCALE_SIGNATURE (General system error)
follow-up: 10 comment:5 by , 14 years ago
Replying to diver:
Looking at his syslogs, could it be that his bfs partition is corrupted?
cf. https://dev.haiku-os.org/attachment/ticket/5551/syslog.11#L3369
USER: failed to create the required index for attribute BEOS:LOCALE_SIGNATURE (General system error)
Brand new hardrive, and for gigles. I tried another new hardrive. same issue.I am suspecting one of 2 possiable source. Either the sata driver has trouble or the context switching is cuasing data loss in the ram. I had less noticeable issues on a older phenon I 9550 cpu but still I did get alot of locking errors, and or what appeared to be context change over lock/unlock thread errors. It usually occured during build or during high stress moments.
Also, what I have seen is that enabling HT on pentium4 cpu's, invariably cuases a no boot situation.i figured it was a known situation.
I will say the my previous motherboard, had alot less trouble, but around revision 39000 it got really unstable, so I replace it and the ram and I sent the cpu back and got a new cpu and 2 new hdd's. the only part of this machine that is more then 30 days old are the 2 sata dvd roms.Those are both less then 6 months old and they work fine.
comment:6 by , 14 years ago
Replying to axeld:
Actually the most likely reason for instability is a weak/broken power supply - that's usually the first thing I replace when I encounter stability problems on someone's system, and it's usually all that needs to be done.
Since, to my knowledge, there weren't any critical changes four weeks ago, I would assume a hardware error on your side the most likely cause of the problem.
it is a 2 week old XFX 850 Powersupply. I have run several stability benchmarks on windows and linux, no problem with the stability of the system, for giggles I put a high end dvom and a osciliscope on the power supply and touched the board traces, voltage deviation and ripple are in the micro volt range with 6 cores of handbrake running uned windows as well as cine bench and 3dmark.
Machine is rock solid.
comment:8 by , 14 years ago
Welcome to the Haiku shell.
~> checkfs -c /boot
165773 nodes checked, 0 blocks not allocated, 0 blocks already set, 0 blocks could be freed
files 133760 directories 31746 attributes 144 attr. dirs 114 indices 9
~>
Is this helpful ?
comment:9 by , 14 years ago
...failed Cc /boot/home/haiku/haiku/generated.x86gcc4/objects/haiku/x86/release/add-ons/kernel/bus_managers/acpi/hwsleep.o ... ...skipped libacpi_ca.a for lack of <src!add-ons!kernel!bus_managers!acpi>hwsleep.o... ...skipped acpi for lack of libacpi_ca.a... ...skipped <HaikuImage>haiku.image-copy-files-dummy-system/add-ons/kernel/bus_managers for lack of acpi... ...skipped haiku-alpha.iso for lack of <HaikuImage>haiku.image-copy-files... ...failed updating 1 target(s)... ...skipped 4 target(s)... ...updated 254 target(s)... ~/haiku/haiku/generated.x86gcc4>
thats with all cpu cores enabled
dropping to 1 cpu core resolves the problem. alot of times I I get a KDL and it typically picks on the vesa accelerant, problem is that the system hardlocks up. so I can't even backtrace. to me the problem looks like some type of data contention between the ram and disk writes cache etc.
If there is a way to test this, please advise.
update the build fialed and threw the machine in kdl and I managed to keep the system up. The syslogs are attached.
by , 14 years ago
by , 14 years ago
Attachment: | syslogfailedbuild added |
---|
follow-up: 11 comment:10 by , 14 years ago
Replying to stargatefan:
Also, what I have seen is that enabling HT on pentium4 cpu's, invariably cuases a no boot situation.i figured it was a known situation.
Unless there's a ticket for such an issue, please assume that it is not a known problem. It's been quite a while since I last ran Haiku on my old P4, but it worked well enough with HT back then.
Replying to stargatefan:
Welcome to the Haiku shell.
~> checkfs -c /boot
165773 nodes checked, 0 blocks not allocated, 0 blocks already set, 0 blocks could be freed
files 133760 directories 31746 attributes 144 attr. dirs 114 indices 9
~>
Is this helpful ?
It doesn't say anything about errors, so that doesn't give any hints other than that the block cache is probably not affected
Replying to stargatefan:
alot of times I I get a KDL and it typically picks on the vesa accelerant, problem is that the system hardlocks up. so I can't even backtrace.
When the kernel panics, the KDL should already display a back trace. You could take a picture or get it from the boot loader (cf. https://dev.haiku-os.org/wiki/ReportingBugs#KernelBugs). If you have a USB keyboard, you need to enter KDL manually first to have a chance of it working on a later panic.
If there is a way to test this, please advise.
As written before, the most interesting information would be to determine whether this is a Haiku regression or not. So, please test an older version to verify that. If it is a Haiku regression, tracking down the exact revision (or at least a narrow range) that caused the issue via bisection would be very helpful. While that may take some time, it will probably take less time on both ends than trying to analyze the issue, particularly when KDL remains unaccessible to you.
PS: In Wiki syntax spaces at the beginning of a paragraph indent it.
follow-up: 12 comment:11 by , 14 years ago
Replying to bonefish:
Unless there's a ticket for such an issue, please assume that it is not a known problem. It's been quite a while since I last ran Haiku on my old P4, but it worked well enough with HT back then.
I will pull the cpu model number for you, but every 2+ ghz p4 I had tried, hyperthreading required being disabled to make the machine run. Not sure why. Is it related ???
It doesn't say anything about errors, so that doesn't give any hints other than that the block cache is probably not affected
Yeah and the system is perfectly fine running on 1 core when doing heavy lifting.
Replying to stargatefan:
When the kernel panics, the KDL should already display a back trace. You could take a picture or get it from the boot loader (cf. https://dev.haiku-os.org/wiki/ReportingBugs#KernelBugs). If you have a USB keyboard, you need to enter KDL manually first to have a chance of it working on a later panic.
I got the syslogs in the ticket now. If those aren't enlightening I will take photos of the kdl tonight, though there seems to be little consistency outside of fialures of the vesa driver. I also get frequent fialures on building due to lost or unfound files when building. Maybe those 2 problems are related ? Anyways to test disk and ram caching for bugs besides KDL ?
As written before, the most interesting information would be to determine whether this is a Haiku regression or not. So, please test an older version to verify that. If it is a Haiku regression, tracking down the exact revision (or at least a narrow range) that caused the issue via bisection would be very helpful. While that may take some time, it will probably take less time on both ends than trying to analyze the issue, particularly when KDL remains unaccessible to you.
I will try to test it down to a range, but I have always had some stability issues with haiku when building.
PS: In Wiki syntax spaces at the beginning of a paragraph indent it.
Horrific typist here, please exscuse, will try to be more cognative of this in the future. is this a IE problem becuase I see a indent in my editing windows but not in the ticket itself ?
comment:12 by , 14 years ago
Replying to stargatefan:
Replying to bonefish:
I will pull the cpu model number for you, but every 2+ ghz p4 I had tried, hyperthreading required being disabled to make the machine run. Not sure why. Is it related ???
No, please open a new ticket.
I got the syslogs in the ticket now. If those aren't enlightening I will take photos of the kdl tonight, though there seems to be little consistency outside of fialures of the vesa driver.
The syslogs aren't enlightening (I had already seen them before my previous comment); they just show repeated crashes of cc1
. The beginning of the syslogs is missing. It may or may not contain interesting information around the time when things start to go wrong.
I also get frequent fialures on building due to lost or unfound files when building. Maybe those 2 problems are related ?
Likely.
Anyways to test disk and ram caching for bugs besides KDL ?
No. You could try to run 6 concurrent while true; do true; done
to see, if problems are already caused by full CPU usage. That would hint towards/rule out a heat issue.
As written before, the most interesting information would be to determine whether this is a Haiku regression or not. So, please test an older version to verify that. If it is a Haiku regression, tracking down the exact revision (or at least a narrow range) that caused the issue via bisection would be very helpful. While that may take some time, it will probably take less time on both ends than trying to analyze the issue, particularly when KDL remains unaccessible to you.
I will try to test it down to a range, but I have always had some stability issues with haiku when building.
Er, at least in the ticket description you say that this is a regression since about 4-6 weeks ago.
PS: In Wiki syntax spaces at the beginning of a paragraph indent it.
Horrific typist here, please exscuse, will try to be more cognative of this in the future.
Well, you already failed completely in your reply. In case you misunderstood me: Please don't start your paragraphs with spaces, as this leads to awful formatting.
is this a IE problem becuase I see a indent in my editing windows but not in the ticket itself ?
At least IE 9 shows the same formatting as Firefox does.
follow-up: 14 comment:13 by , 14 years ago
Here is a crash log, with 2 threads and 6 core enabled, mind you the ICU package is in fact there.
by , 14 years ago
Attachment: | New text file added |
---|
follow-up: 15 comment:14 by , 14 years ago
Replying to stargatefan:
Here is a crash log, with 2 threads and 6 core enabled, mind you the ICU package is in fact there.
The userland crashes aren't all that interesting.
comment:15 by , 14 years ago
Replying to bonefish:
Replying to stargatefan:
Here is a crash log, with 2 threads and 6 core enabled, mind you the ICU package is in fact there.
The userland crashes aren't all that interesting.
The crux of the problem is that they only occur when I have multiple cpu cores enabled, regardless of thread settings.
comment:16 by , 13 years ago
Ok update, I tested a bunch of builds going back a ways, this problem is persistent back to at least a2. Its kind of random to. Anyways here is the kernel info from the latest crash, also tracker crashed and I couldn't open a text editor to save the 2 back traces from the gcc2 bin tools crash and the tracker crash. syslog name is crash.
by , 13 years ago
by , 13 years ago
by , 13 years ago
follow-up: 18 comment:17 by , 13 years ago
Please don't attach any more userland crash infos. As I wrote they are not that interesting, since the userland programs just seem to be the victims. A KDL info might be of more interest.
follow-up: 19 comment:18 by , 13 years ago
Replying to bonefish:
Please don't attach any more userland crash infos. As I wrote they are not that interesting, since the userland programs just seem to be the victims. A KDL info might be of more interest.
When the system goes to gdb, how do I invoke KDL ?
follow-up: 20 comment:19 by , 13 years ago
Replying to stargatefan:
When the system goes to gdb, how do I invoke KDL ?
You can enter KDL by pressing Alt-SysReq-D. Only PS/2 keyboards and USB keyboards connected to an UHCI controller remain usable in KDL.
You might have misunderstood something, though. When the system crashes, it automatically enters KDL, printing a panic message and a stack trace. Only when applications crash, gdb is invoked. There's the special case of the app server, input server, or registrar crashing, which clears the whole screen white and runs gdb full-screen, but other than that it's the same.
I'm only insisting on the KDL info, because you mentioned KDLs in the description. The userland stack traces just indicate that somehow memory got corrupted. That's really all one can say about those. A KDL might tell more about the cause.
How much memory does the machine have? Is swap enabled? At the time the crashes start, is the memory fully used (including caches)?
follow-up: 21 comment:20 by , 13 years ago
Replying to bonefish:
You can enter KDL by pressing Alt-SysReq-D. Only PS/2 keyboards and USB keyboards connected to an UHCI controller remain usable in KDL.
thank you, I will try that. I do have a ps2 keyboard so that should not be a issue.
You might have misunderstood something, though. When the system crashes, it automatically enters KDL, printing a panic message and a stack trace. Only when applications crash, gdb is invoked. There's the special case of the app server, input server, or registrar crashing, which clears the whole screen white and runs gdb full-screen, but other than that it's the same.
I have mostly gotten gbd but on ocassion I will get a kdl, though rather infrequently. I think I may not have been clear enough in my description of the problem to be helpful. I have gotten tossed into a special gdb white screen case on 2 occassions.
I'm only insisting on the KDL info, because you mentioned KDLs in the description. The userland stack traces just indicate that somehow memory got corrupted. That's really all one can say about those. A KDL might tell more about the cause.
No problem, I will attempt to get the required information for you today. Thank you for your patience.
How much memory does the machine have? Is swap enabled? At the time the crashes start, is the memory fully used (including caches)?
I have 4gb "brand new" ddr3 actually under clocked to 1266mhz instead of 1333. Just a timming thing with the chipset/cpu/motherboard. I have run memtest from hirens etc and the machine check out after multiple tests. during builds it looks to have around 1-1.2gb of memory in use. What is swap ? Is that virtual memory on the hdd ? if so, no it is not enabled. I have replaced the entire machine to. Is there any other info aside from some kdl traces that might be useful ?
the only other thing I find peculiar is that this only happens when I use more then 1 core. core and it happens, 3 cores it happens. 1 core and it builds just fine. this lead me to send the cpu to AMD for a rma, and then sent it back with testing certification and the test report, everything check out. As I noted in my original post, the psu, mb, ram,hdd are all new, the cpu was sent back to amd and they gave it a clean bill of health to. Hiren boot cd pc tests all pass with flying colors.
So I am a bit lost as to where to go from here.
comment:21 by , 13 years ago
Replying to stargatefan:
I have 4gb "brand new" ddr3 actually under clocked to 1266mhz instead of 1333. Just a timming thing with the chipset/cpu/motherboard. I have run memtest from hirens etc and the machine check out after multiple tests. during builds it looks to have around 1-1.2gb of memory in use.
OK, that basically rules out VM subsystem issues under memory pressure.
What is swap ? Is that virtual memory on the hdd ?
Yep.
Is there any other info aside from some kdl traces that might be useful ?
ATM, I can't think of anything.
So I am a bit lost as to where to go from here.
Me too. These kinds of random issues are incredibly hard to debug. If it was a regression, tracking down the revision to blame via binary search would be the best approach. But unless I misunderstood you it's not really a regression after all.
by , 13 years ago
Attachment: | screenshot4.png added |
---|
by , 13 years ago
Attachment: | screenshot5.png added |
---|
comment:22 by , 13 years ago
Bonefish, I attached some screenshot of activity monitor while running a build. I don't know if any of that data is useful, but there seems to be lots of pagefaults, not that I have any sort of comparison to know what is a normal number or not. Also I monitored CPU temps, in the 60-65c range with 12thread running on 6 core. Oddly I thought the system was using 1gb of ram during builds, obviously not. It is hovering around 230-500mb with 3gb of cached ram.
I trid for 2 hours to get the thing to crash, built 5 times sucessively without so much as a compliant, if I try again tommorow, it'll likely crash everytime I hit enter. Such is life I geuss. I thought maybe this info could be useful so I attached it.
On the hardware side of things
HDD tempature 20c CPU 59-61C, check with surface K thermalcouple, ram was around 40c, Voltage under load deviated less then .01 from specification to all devices.
Still a bit mistified.
by , 13 years ago
Attachment: | New text file.2 added |
---|
comment:23 by , 13 years ago
You are having some type of memory issue. From your syslog:
1 KERN: write access attempted on write-protected area 0xe91e0 at 0x004b5000 2 KERN: vm_page_fault: vm_soft_fault returned error 'Permission denied' on fault at 0x4b5ae3, ip 0x4938c0, write 1, user 1, thread 0x407e 3 KERN: vm_page_fault: thread "cc1" (16510) in team "cc1" (16510) tried to write address 0x4b5ae3, ip 0x4938c0 ("cc1_seg0ro" +0x2938c0)
That message keeps repeating over and over. The vm in vm_page_fault refers to virtual memory. ie, Haiku is trying to write to a write protected (read only) memory region and it is denied access.
What to do? This could be a Haiku or hardware issue.
You ran memtest86+ 5 times or more? It passed all 5 times?
"I have 4gb "brand new" ddr3 actually under clocked to 1266mhz instead of 1333. Just a timming thing with the chipset/cpu/motherboard."
Under-clocking RAM because of timing issue says you may have hardware issue. RAM should run at default clock speed and pass memtest86+ 5 or more times.
1st, run checkfs to see if your filesystem is corrupt.
2nd, In both your screenshots you show with swap file (swap space, virtual memory) enabled. Try with swap space a) 0 MB (no swap) b) 500 MB and c) 2 GB. ie, zero swap, small swap and large swap to see if that changes anything.
Right now there seems to be issue with disabling swap file.
http://dev.haiku-os.org/ticket/7550
http://dev.haiku-os.org/ticket/7742
So, turn virtual memory off in preferences, delete swap (/boot/common/var/swap) and run checkfs to fix inodes issue.
follow-up: 25 comment:24 by , 13 years ago
Maybe swap file corrupt? Another reason to delete swap & run checkfs. That way you can start with "fresh" swap file and fix any filesystem issues.
comment:25 by , 13 years ago
Replying to tonestone57:
Maybe swap file corrupt? Another reason to delete swap & run checkfs. That way you can start with "fresh" swap file and fix any filesystem issues.
Tried that, I have a feeling its a disk read virtual memory problem becuase when I shrink the virtual memory size to 5 mb it gets really pissed off. I was thinking that a few weeks ago.All hardware pass's all tests I can try, as to the memory bieng underclocked, thats what it took to make the timming and the cpu/nb all perfectly happy with the clockspeed the cpu is running at and the way the board is set up. Its happy running under very heavy loads on both linux and windows.Its got to be some stupid problem to.
comment:26 by , 13 years ago
You did disable virtual memory (swap) right? swap size should be zero in size and checked in ActivityMonitor. If swap is disabled and still acting up then that would say real big problem. Issue may relate to virtual memory or even something else.
You have to really push the system to get it to KDL. At least then you can give some debug info. Maybe try jam -j6 plus run bunch of other programs at same time like Teapot, Haiku3d, Chart, etc. Hopefully that gets the OS to crash for you. It may also corrupt files so be ready to deal with that. You need to run lots and lots of programs to max out your CPUs and push the OS very hard to crash.
Virtual memory was not disabling right and I do not think that bug was fixed yet. Disable virtual memory from preferences, reboot, delete /boot/common/var/swap, reboot, run checkfs, reboot and then try run jam and check ActivityMonitor for swap size of zero.
Only by giving KDL output do you have any chance to get this bug fixed. That means getting the OS to crash into KDL.
comment:27 by , 10 years ago
Hi, Do you still have these problems? A lot of things were fixed in Haiku since 3 years ago.
comment:28 by , 9 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
No response in 14 months; assuming fixed.
Actually the most likely reason for instability is a weak/broken power supply - that's usually the first thing I replace when I encounter stability problems on someone's system, and it's usually all that needs to be done.
Since, to my knowledge, there weren't any critical changes four weeks ago, I would assume a hardware error on your side the most likely cause of the problem.