Opened 13 years ago

Closed 13 years ago

#1018 closed bug (fixed)

Booting on Athlon 64 X2 fails with both cpus enabled (vmware)

Reported by: ekdahl Owned by: marcusoverhagen
Priority: blocker Milestone: R1
Component: System/Kernel Version: R1/pre-alpha1
Keywords: Cc: axeld, geist
Blocked By: Blocking:
Has a Patch: no Platform: x86

Description

Booting stops at the boot screen when enabling both cpus. Using only one cpu works. I'm attaching serial logs for one cpu and both cpus. Tested in vmware workstation with hrev20122.

Attachments (11)

serial_output_1_cpu.txt (31.1 KB ) - added by ekdahl 13 years ago.
serial_output_2_cpus.txt (4.1 KB ) - added by ekdahl 13 years ago.
new_serial_output_2_cpus.txt (5.2 KB ) - added by ekdahl 13 years ago.
newer_serial_output_2_cpus.txt (4.4 KB ) - added by ekdahl 13 years ago.
r20157.txt (10.3 KB ) - added by marcusoverhagen 13 years ago.
r20159.txt (22.6 KB ) - added by marcusoverhagen 13 years ago.
r20162_serial_output.txt (5.0 KB ) - added by ekdahl 13 years ago.
r20231_scheduler_trace.txt (222.9 KB ) - added by marcusoverhagen 13 years ago.
serial_output_2_cpus_r20359_occasional.txt (23.4 KB ) - added by ekdahl 13 years ago.
Gets to the point where I can see the desktop background color and the mouse cursor, only gets this far sometimes.
serial_output_2_cpus_r20359_regular.txt (21.9 KB ) - added by ekdahl 13 years ago.
This is where it most often stops
serial_output_1_cpu_r20359.txt (27.9 KB ) - added by ekdahl 13 years ago.
As can be seen in this log, booting on one cpu works

Download all attachments as: .zip

Change History (36)

by ekdahl, 13 years ago

Attachment: serial_output_1_cpu.txt added

by ekdahl, 13 years ago

Attachment: serial_output_2_cpus.txt added

comment:1 by jfreeman, 13 years ago

This happens on real hardware, as well, for me.

comment:2 by mmu_man, 13 years ago

I remember hearing that apm was dangerous on SMP... Still, I used to load the apm driver in R5 on a dual celeron (BP6) and it worked fine for powering down. But since it's the last stuff showing up in the log... did you try disabling it ?

in reply to:  2 ; comment:3 by ekdahl, 13 years ago

Replying to mmu_man:

I remember hearing that apm was dangerous on SMP... Still, I used to load the apm driver in R5 on a dual celeron (BP6) and it worked fine for powering down. But since it's the last stuff showing up in the log... did you try disabling it ?

It makes no difference, debug output is exactly the same, so I'm wondering if it really gets disabled. I tried disabling it both in kernel settings file and boot menu.

in reply to:  3 comment:4 by mt, 13 years ago

Replying to ekdahl:

It makes no difference, debug output is exactly the same, so I'm wondering if it really gets disabled. I tried disabling it both in kernel settings file and boot menu.

Is your boot menu is "Disable Hyper-Threading" ? My Core2Duo machine was so, (C2D does not support HT) I disable supports_hyper_threading() in boot/platform/bios_ia32/smp.cpp (set to always return false) then "disable smp" from boot menu, now run well.

comment:5 by marcusoverhagen, 13 years ago

Cc: axeld added
Owner: changed from axeld to marcusoverhagen
Priority: normalblocker

I have the same problem here with Core 2 Duo E6600. The regression occurred in hrev20072.

I'm going to debug this.

comment:6 by marcusoverhagen, 13 years ago

Status: newassigned

comment:7 by geist, 13 years ago

i think i fixed it with change 20154. The new cpuid code was writing to the current cpu structure before it was set up on non boot cpus. The solution was to change the ordering of initialization a bit on non boot cpus, which isn't a generally great solution but should work for now. See if it repros on your machine.

comment:8 by ekdahl, 13 years ago

It gets a little bit further now. I've attached the new serial debug output.

by ekdahl, 13 years ago

comment:9 by geist, 13 years ago

Cc: geist added

by ekdahl, 13 years ago

by marcusoverhagen, 13 years ago

Attachment: r20157.txt added

by marcusoverhagen, 13 years ago

Attachment: r20159.txt added

comment:11 by geist, 13 years ago

got another fix for it in 20160. give it a whirl.

by ekdahl, 13 years ago

Attachment: r20162_serial_output.txt added

in reply to:  description comment:12 by tigerdog, 13 years ago

Replying to ekdahl:

Booting stops at the boot screen when enabling both cpus. Using only one cpu works. I'm attaching serial logs for one cpu and both cpus. Tested in vmware workstation with hrev20122.

Happens here on real hardware. ECS mobo, Athlon 64x2 4200+. I have no serial debug capability at the moment. With the 20070218 build, booting stops after displaying features of CPU 1. With multiprocessor support disabled from boot menu, system boots normally.

comment:13 by geist, 13 years ago

I fixed it after the 0218 build, try it with a newer one.

in reply to:  13 comment:14 by mt, 13 years ago

Replying to geist:

I fixed it after the 0218 build, try it with a newer one.

I can't boot my Core2Duo machine with multiprocesser in hrev20177.

in reply to:  13 comment:15 by tigerdog, 13 years ago

Replying to geist:

I fixed it after the 0218 build, try it with a newer one.

I tried hrev20182 from haikuhost.com on real hardware. Still fails unless I disable SMP during boot. Console output (copied by hand - no serial debug here): code32 0xf000, 0x80bc, length 0xc9ea code16 0xf000, length 0x418c data 0xfdf0, length 0x0 CPU1: type 0 family 15 model 11 stepping 1 string AuthenticAMD CPU1: vendor 'AMD' model name "AMD Athlon(tm) 64 x2 Dual Core Processor 4200+' CPU1: features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clfsh mmx fxsr sse sse2 ntt sse3 syscall mx mmxest ffxsr long 3dnowext 3dnow

I hope this helps.

comment:16 by marcusoverhagen, 13 years ago

http://svn.berlios.de/viewcvs/haiku/haiku/trunk/src/system/kernel/smp.c?rev=20200&r1=20160&r2=20200

after that fix, and with scheduler tracing enabled, it booted all the way until it paniced because no boot volume was found (I'm still working on ahci support)

but without scheduler tracing, it stops pretty early, but later than before the fix, see http://pastebin.ca/368034

comment:17 by tigerdog, 13 years ago

still seeing the same behavior with hrev20208, downloaded on 2007-02-22.

comment:18 by geist, 13 years ago

sorry, I'm having a terrible time reproducing this. I recently fired up an old dual athlon MP I had lying around to try to reproduce it, but I'm getting something else that's blocking debugging it. I'm pretty sure some of the smp code is kind of rotted a bit, and i've made it worse before it gets better.

If anyone can reproduce it and figure out precisely what the problem is you'll be my hero. I just can't do it here.

by marcusoverhagen, 13 years ago

Attachment: r20231_scheduler_trace.txt added

comment:19 by marcusoverhagen, 13 years ago

I applied a small cleanup to arch_smp.c and also added volatile to apic access, but this doesn't help.

I enabled TRACE in main.c scheduler.cpp and arch_smp.c I did not enable TRACE_TIMER in arch_smp.c Please have a look at the 20231_scheduler_trace.txt

  • when not enabling tracing in scheduler.cpp, the system

stops at INIT : main: done... begin idle loop on cpu 1

  • reschedule is never executed on cpu 1
  • "inter-cpu interrupt on cpu 1" appears frequently. what does it do?
  • when enabling TRACE_TIMER in arch_smp.c, reschudule is executed on both cpus
  • the apic time function only disables interrupts, it this enough?

comment:20 by marcusoverhagen, 13 years ago

This appears to be the same bug: http://axeld.blogspot.com/2005/10/not-yet.html

in reply to:  20 comment:21 by tigerdog, 13 years ago

Replying to marcusoverhagen:

This appears to be the same bug: http://axeld.blogspot.com/2005/10/not-yet.html

Maybe yes, maybe no. The real puzzle is that sometime in the recent past (within the last 60 days or so) Haiku did boot correctly and run on both CPUs of my Athlon 64. Then I stopped downloading nightly builds for a while; now it doesn't work. I'll try to research and find out when things stopped by loading older images if I can find them.

comment:22 by marcusoverhagen, 13 years ago

Seems to work now, after the recent changes made by geist.

However, I can only test the boot process up to the point where the root partition is supposed to be mounted.

Can anyone else confirm?

comment:23 by tigerdog, 13 years ago

Tested here using 1 March image from BuildFactory on real hardware (Athlon64x2 3800+.) Boot no longer stops at the point indicated in my previous comment. Booting dies after appserver starts (thread 47 caused segment violation) but this may be unrelated.

comment:24 by geist, 13 years ago

that's what I'm seeing too. Later on the system dies because the first couple of processes gets clobbered somehow. I don't think it's SMP related, it may be something we're not doing right on newer cpus, or it could be just a fast machine problem. I vote to mark this one closed and track the app_server/sh/whatever failures with another bug.

comment:25 by marcusoverhagen, 13 years ago

Resolution: fixed
Status: assignedclosed

(I don't know why the font is set to bold)

Closing this bug.

by ekdahl, 13 years ago

Gets to the point where I can see the desktop background color and the mouse cursor, only gets this far sometimes.

by ekdahl, 13 years ago

This is where it most often stops

by ekdahl, 13 years ago

As can be seen in this log, booting on one cpu works

Note: See TracTickets for help on using tickets.