Opened 12 years ago

Closed 12 years ago

#1018 closed bug (fixed)

Booting on Athlon 64 X2 fails with both cpus enabled (vmware)

Reported by: ekdahl Owned by: marcusoverhagen
Priority: blocker Milestone: R1
Component: System/Kernel Version: R1/pre-alpha1
Keywords: Cc: axeld, geist
Blocked By: Blocking:
Has a Patch: no Platform: x86

Description

Booting stops at the boot screen when enabling both cpus. Using only one cpu works. I'm attaching serial logs for one cpu and both cpus. Tested in vmware workstation with hrev20122.

Attachments (11)

serial_output_1_cpu.txt (31.1 KB) - added by ekdahl 12 years ago.
serial_output_2_cpus.txt (4.1 KB) - added by ekdahl 12 years ago.
new_serial_output_2_cpus.txt (5.2 KB) - added by ekdahl 12 years ago.
newer_serial_output_2_cpus.txt (4.4 KB) - added by ekdahl 12 years ago.
r20157.txt (10.3 KB) - added by marcusoverhagen 12 years ago.
r20159.txt (22.6 KB) - added by marcusoverhagen 12 years ago.
r20162_serial_output.txt (5.0 KB) - added by ekdahl 12 years ago.
r20231_scheduler_trace.txt (222.9 KB) - added by marcusoverhagen 12 years ago.
serial_output_2_cpus_r20359_occasional.txt (23.4 KB) - added by ekdahl 12 years ago.
Gets to the point where I can see the desktop background color and the mouse cursor, only gets this far sometimes.
serial_output_2_cpus_r20359_regular.txt (21.9 KB) - added by ekdahl 12 years ago.
This is where it most often stops
serial_output_1_cpu_r20359.txt (27.9 KB) - added by ekdahl 12 years ago.
As can be seen in this log, booting on one cpu works

Download all attachments as: .zip

Change History (36)

Changed 12 years ago by ekdahl

Attachment: serial_output_1_cpu.txt added

Changed 12 years ago by ekdahl

Attachment: serial_output_2_cpus.txt added

comment:1 Changed 12 years ago by jfreeman

This happens on real hardware, as well, for me.

comment:2 Changed 12 years ago by mmu_man

I remember hearing that apm was dangerous on SMP... Still, I used to load the apm driver in R5 on a dual celeron (BP6) and it worked fine for powering down. But since it's the last stuff showing up in the log... did you try disabling it ?

comment:3 in reply to:  2 ; Changed 12 years ago by ekdahl

Replying to mmu_man:

I remember hearing that apm was dangerous on SMP... Still, I used to load the apm driver in R5 on a dual celeron (BP6) and it worked fine for powering down. But since it's the last stuff showing up in the log... did you try disabling it ?

It makes no difference, debug output is exactly the same, so I'm wondering if it really gets disabled. I tried disabling it both in kernel settings file and boot menu.

comment:4 in reply to:  3 Changed 12 years ago by mt

Replying to ekdahl:

It makes no difference, debug output is exactly the same, so I'm wondering if it really gets disabled. I tried disabling it both in kernel settings file and boot menu.

Is your boot menu is "Disable Hyper-Threading" ? My Core2Duo machine was so, (C2D does not support HT) I disable supports_hyper_threading() in boot/platform/bios_ia32/smp.cpp (set to always return false) then "disable smp" from boot menu, now run well.

comment:5 Changed 12 years ago by marcusoverhagen

Cc: axeld added
Owner: changed from axeld to marcusoverhagen
Priority: normalblocker

I have the same problem here with Core 2 Duo E6600. The regression occurred in hrev20072.

I'm going to debug this.

comment:6 Changed 12 years ago by marcusoverhagen

Status: newassigned

comment:7 Changed 12 years ago by geist

i think i fixed it with change 20154. The new cpuid code was writing to the current cpu structure before it was set up on non boot cpus. The solution was to change the ordering of initialization a bit on non boot cpus, which isn't a generally great solution but should work for now. See if it repros on your machine.

comment:8 Changed 12 years ago by ekdahl

It gets a little bit further now. I've attached the new serial debug output.

Changed 12 years ago by ekdahl

comment:9 Changed 12 years ago by geist

Cc: geist added

comment:10 Changed 12 years ago by geist

Changed 12 years ago by ekdahl

Changed 12 years ago by marcusoverhagen

Attachment: r20157.txt added

Changed 12 years ago by marcusoverhagen

Attachment: r20159.txt added

comment:11 Changed 12 years ago by geist

got another fix for it in 20160. give it a whirl.

Changed 12 years ago by ekdahl

Attachment: r20162_serial_output.txt added

comment:12 in reply to:  description Changed 12 years ago by tigerdog

Replying to ekdahl:

Booting stops at the boot screen when enabling both cpus. Using only one cpu works. I'm attaching serial logs for one cpu and both cpus. Tested in vmware workstation with hrev20122.

Happens here on real hardware. ECS mobo, Athlon 64x2 4200+. I have no serial debug capability at the moment. With the 20070218 build, booting stops after displaying features of CPU 1. With multiprocessor support disabled from boot menu, system boots normally.

comment:13 Changed 12 years ago by geist

I fixed it after the 0218 build, try it with a newer one.

comment:14 in reply to:  13 Changed 12 years ago by mt

Replying to geist:

I fixed it after the 0218 build, try it with a newer one.

I can't boot my Core2Duo machine with multiprocesser in hrev20177.

comment:15 in reply to:  13 Changed 12 years ago by tigerdog

Replying to geist:

I fixed it after the 0218 build, try it with a newer one.

I tried hrev20182 from haikuhost.com on real hardware. Still fails unless I disable SMP during boot. Console output (copied by hand - no serial debug here): code32 0xf000, 0x80bc, length 0xc9ea code16 0xf000, length 0x418c data 0xfdf0, length 0x0 CPU1: type 0 family 15 model 11 stepping 1 string AuthenticAMD CPU1: vendor 'AMD' model name "AMD Athlon(tm) 64 x2 Dual Core Processor 4200+' CPU1: features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clfsh mmx fxsr sse sse2 ntt sse3 syscall mx mmxest ffxsr long 3dnowext 3dnow

I hope this helps.

comment:16 Changed 12 years ago by marcusoverhagen

http://svn.berlios.de/viewcvs/haiku/haiku/trunk/src/system/kernel/smp.c?rev=20200&r1=20160&r2=20200

after that fix, and with scheduler tracing enabled, it booted all the way until it paniced because no boot volume was found (I'm still working on ahci support)

but without scheduler tracing, it stops pretty early, but later than before the fix, see http://pastebin.ca/368034

comment:17 Changed 12 years ago by tigerdog

still seeing the same behavior with hrev20208, downloaded on 2007-02-22.

comment:18 Changed 12 years ago by geist

sorry, I'm having a terrible time reproducing this. I recently fired up an old dual athlon MP I had lying around to try to reproduce it, but I'm getting something else that's blocking debugging it. I'm pretty sure some of the smp code is kind of rotted a bit, and i've made it worse before it gets better.

If anyone can reproduce it and figure out precisely what the problem is you'll be my hero. I just can't do it here.

Changed 12 years ago by marcusoverhagen

Attachment: r20231_scheduler_trace.txt added

comment:19 Changed 12 years ago by marcusoverhagen

I applied a small cleanup to arch_smp.c and also added volatile to apic access, but this doesn't help.

I enabled TRACE in main.c scheduler.cpp and arch_smp.c I did not enable TRACE_TIMER in arch_smp.c Please have a look at the 20231_scheduler_trace.txt

  • when not enabling tracing in scheduler.cpp, the system

stops at INIT : main: done... begin idle loop on cpu 1

  • reschedule is never executed on cpu 1
  • "inter-cpu interrupt on cpu 1" appears frequently. what does it do?
  • when enabling TRACE_TIMER in arch_smp.c, reschudule is executed on both cpus
  • the apic time function only disables interrupts, it this enough?

comment:20 Changed 12 years ago by marcusoverhagen

This appears to be the same bug: http://axeld.blogspot.com/2005/10/not-yet.html

comment:21 in reply to:  20 Changed 12 years ago by tigerdog

Replying to marcusoverhagen:

This appears to be the same bug: http://axeld.blogspot.com/2005/10/not-yet.html

Maybe yes, maybe no. The real puzzle is that sometime in the recent past (within the last 60 days or so) Haiku did boot correctly and run on both CPUs of my Athlon 64. Then I stopped downloading nightly builds for a while; now it doesn't work. I'll try to research and find out when things stopped by loading older images if I can find them.

comment:22 Changed 12 years ago by marcusoverhagen

Seems to work now, after the recent changes made by geist.

However, I can only test the boot process up to the point where the root partition is supposed to be mounted.

Can anyone else confirm?

comment:23 Changed 12 years ago by tigerdog

Tested here using 1 March image from BuildFactory on real hardware (Athlon64x2 3800+.) Boot no longer stops at the point indicated in my previous comment. Booting dies after appserver starts (thread 47 caused segment violation) but this may be unrelated.

comment:24 Changed 12 years ago by geist

that's what I'm seeing too. Later on the system dies because the first couple of processes gets clobbered somehow. I don't think it's SMP related, it may be something we're not doing right on newer cpus, or it could be just a fast machine problem. I vote to mark this one closed and track the app_server/sh/whatever failures with another bug.

comment:25 Changed 12 years ago by marcusoverhagen

Resolution: fixed
Status: assignedclosed

(I don't know why the font is set to bold)

Closing this bug.

Changed 12 years ago by ekdahl

Gets to the point where I can see the desktop background color and the mouse cursor, only gets this far sometimes.

Changed 12 years ago by ekdahl

This is where it most often stops

Changed 12 years ago by ekdahl

As can be seen in this log, booting on one cpu works

Note: See TracTickets for help on using tickets.