Opened 10 years ago

Closed 6 years ago

#10509 closed enhancement (fixed)

Stack is not aligned

Reported by: pulkomandy Owned by: nobody
Priority: normal Milestone: Unscheduled
Component: System/Kernel Version: R1/Development
Keywords: Cc:
Blocked By: Blocking:
Platform: x86-64

Description

Our stack is aligned only on 8-byte. GCC sometimes generates code that assumes a 16-byte alignment, with instructions such as MOVDQA or MOVAPS. This leads to a "General Protection Fault" crash.

This can be worked around with -mtackrealign, but this makes the code slower. Instead, we should fix all places where our code doesn't guarantee stack alignment.

Attachments (5)

HaikuLauncher-43356-debug-06-02-2014-11-14-11.report (14.0 KB ) - added by pulkomandy 10 years ago.
Crash report.
HaikuLauncher-50315-debug-06-02-2014-12-20-21.report (30.1 KB ) - added by pulkomandy 10 years ago.
align-stack.diff (3.5 KB ) - added by bonefish 10 years ago.
Implementation suggestion (untested)
10509_x86_64.patch (3.7 KB ) - added by korli 10 years ago.
x86_64 patch
0002-x86_64-Glue-code-Keep-stack-16-byte-aligned.patch (1.3 KB ) - added by simonsouth 8 years ago.
Glue code should not be un-aligning stack

Download all attachments as: .zip

Change History (36)

by pulkomandy, 10 years ago

Crash report.

comment:1 by korli, 10 years ago

This crash report involves a debugger call. Not a crash?

comment:2 by korli, 10 years ago

Component: System/KernelSystem
Owner: changed from axeld to nobody

comment:3 by bonefish, 10 years ago

Component: SystemSystem/Kernel
Owner: changed from nobody to axeld

The initial stack for a new thread is set up in the architecture specific kernel function arch_thread_enter_userspace() (x86, x86-64). All that has to be done is align the value of stackTop to 16 and offset it as necessary (note, the stack grows downward, so the address must only be decreased).

I considered doing that already, but it hadn't been clear to me what that offset should be. According to the crash report the movaps instruction tries to access 0xffffff48(%ebp), which, I suppose, means that for x86 ebp (the stack frame pointer) should be offset by 8, which means that for the initial stack pointer (esp) esp % 16 == 12 should hold (due to ebp = esp - 4). The main() prologue gcc generates seems to agree:

    1ad5:       8d 4c 24 04             lea    0x4(%esp),%ecx
    1ad9:       83 e4 f0                and    $0xfffffff0,%esp
    1adc:       ff 71 fc                pushl  -0x4(%ecx)
    1adf:       55                      push   %ebp
    1ae0:       89 e5                   mov    %esp,%ebp

I think the best way to implement that is by having arch_randomize_stack_pointer() enforce that property and change all callers (namely arch_thread_enter_userspace() and get_signal_stack()) to supply an address that has already been adjusted to make the needed room on the stack.

comment:4 by kallisti5, 10 years ago

Milestone: UnscheduledR1/alpha5

comment:5 by pulkomandy, 10 years ago

Sorry, I'm not too familiar with this code. Modifying arch_randomize_stack_pointer is ok. However, in arch_thread_enter_userspace, the stackTop is decremented by12 already, as the "args" are pushed on the stack. Wouldn't that break the alignment?

In the current code, it means we get stackTop % 16 == 4, and modifying arch_randomize_stack_pointer to substract 4 from the aligned value would lead to stackTop % 16 == 0 after the push. Still not what we want. If I change arch_randomize_stack_pointer so the alignment gets correct after the push, then get_signal_stack, which doesn't appear to push anything, would be misaligned.

I'm not sure how to adjust the stack size, either. Callers use thread->signal_stack_size and thread->signal_stack_base, and the respective user_ fields. These are allocated in create_thread_user_stack, and may even be provided by the user. Is it needed to increase the stack size there, or is it acceptable to lose some bytes because of the alignment without adjusting it?

comment:6 by bonefish, 10 years ago

The stack size doesn't have to be adjusted. I'm attaching a quick patch. It's untested and possibly broken, but I hope it clarifies what I mean.

by bonefish, 10 years ago

Attachment: align-stack.diff added

Implementation suggestion (untested)

comment:7 by pulkomandy, 10 years ago

Ok, I ran the WebKit testsuite with this, and it solves the JS crashes I was getting. Applied the patch in hrev46886. Thanks for doing it!

I don't know if x86_64 needs the same change, and if the offset is the same. Leaving the ticket open until this is clarified.

comment:8 by bonefish, 10 years ago

Given that the 64 bit code looks very similar to the 32 bit code, I think it needs to be adjusted analogously. I don't know about the stack offset either. Should be verified with code that requires the alignment.

On a related note, we may also want to check whether we do/should align the kernel stacks. While the alignment shouldn't be strictly required, doing that may improve the performance.

comment:9 by pdziepak, 10 years ago

Fortunately, System V ABI for x86_64 states explicitly what the alignment requirement is:

The end of the input argument area shall be aligned on a 16 byte boundary.

Apparently, that's exactly the same requirement we have for i386 (well, that's hardly surprising) and means that we need to subtract 8 from the initial stack pointer returned by arch_randomize_stack_pointer.

in reply to:  8 comment:10 by korli, 10 years ago

Replying to bonefish:

Should be verified with code that requires the alignment.

Mesa has such a requirement on x86_64.

by korli, 10 years ago

Attachment: 10509_x86_64.patch added

x86_64 patch

comment:11 by korli, 10 years ago

patch: 01

comment:12 by korli, 10 years ago

Here is a patch for x86_64. It doesn't seem to break anything, but unfortunately I couldn't check if things are better with webkit. I thought Mesa needed such an alignment but I'm not sure anymore it's the reason for the crash I met.

comment:13 by js, 10 years ago

I applied the patch and tried LLVM again, but LLVM is still crashing due to unaligned stack access using movdqa. ls -l /system/kernel_x86_64 shows a date that is after I applied the patch, so I'm assuming I replaced the kernel correctly.

comment:14 by js, 10 years ago

Oh, forgot to mention: Our debugger shows an %rsp that ends in 0x8 and the movdqa stores its value in (%rsp). When I look at the %rsp in _start, it ends in 0x0, according to the debugger. IMHO it should be the other way around: When you enter a function, the %rsp should end in 0x8. The first thing the prologue will do is pushing %rbq, so that after the push the alignment will end 0x0 and thus be correct.

comment:15 by pulkomandy, 10 years ago

Milestone: R1/alpha5Unscheduled
Platform: x86x86-64

The issue is fixed for x86; this is the only supported arch for R1; moving out of alpha5 milestone.

by simonsouth, 8 years ago

Glue code should not be un-aligning stack

comment:16 by simonsouth, 8 years ago

Trying to use a libroot.so built with debugging information on x86_64 will reliably reproduce this issue with a general protection fault in runtime_loader. The victim is memset_sse, whose first line creates a variable on the stack that must be 16-byte-aligned. (When compiled with -O2 gcc places this variable in a register, so a non-debug build works.)

Applying korli's patch helps but doesn't resolve the issue, since the debugger shows the stack becomes misaligned again once runtime_loader calls the initialization routine of the image it loads from disk.

This is because the glue code for x86_64 upsets the stack alignment by pushing a single register onto the stack before calling initialization code that eventually calls into libroot.

I've added a patch that modifies the glue code to keep the stack 16-byte-aligned. With korli's patch and this one applied the issue seems to be completely resolved; I can produce a stable nightly image (and a fully debuggable libroot.so) though the packages the build system downloads, notably WebKit and the OpenGL renderer, will need to be rebuilt with the new glue code before everything works normally again.

comment:17 by korli, 8 years ago

Good analysis so far, your patch applied in hrev49731. The other patch is a bit dated, I'll have a look again before applying. I'll also try your hint at reproducing the issue.

comment:18 by korli, 8 years ago

I could effectively reproduce with a libroot.so with debug info on a Haiku version without the glue code patch. But once the glue code patch is applied, I get no crash, whereas the kernel patch isn't yet applied. Any ideas on what could trigger the crash without the kernel patch?

comment:19 by simonsouth, 8 years ago

Not off the top of my head. I tested only with both patches applied. Maybe the glue code was the issue all along.

I'll drop your patch from my local commits and build again and see if I can find another way to trigger this.

comment:20 by simonsouth, 8 years ago

So far I've failed to reproduce this problem with (only) the glue-code patch applied.

Building as much of the system as possible with debugging enabled, in the hopes of forcing greater reliance on the stack, didn't have any effect. Neither did compiling libroot with -O3 -msse to try to get gcc to output more SSE instructions.

The result is always the same: Libraries built with the new glue code run fine; libraries built with the old cause a GPF.

I think the glue-code fix might actually have resolved this.

comment:21 by diver, 8 years ago

Could #11920 be related?

comment:22 by pulkomandy, 8 years ago

Probably not, this problem is only on the userland side.

comment:23 by pulkomandy, 7 years ago

I think there still is a problem with 32-bit as well.

Simple test program:

#include <stdio.h>

void pa(void* a)
{
        printf("%x\n", a);
}

int main(void)
{
        int __attribute__((aligned(16))) foo;
        pa(&foo);

        return 0;
}

The address for foo should end in "0" (aligned to 16 bytes). But when built with gcc2, it ends with 4.

The code generated:

main:
        pushl %ebp
        movl %esp,%ebp
        subl $20,%esp
        pushl %ebx
        call .L7
.L7:
        popl %ebx
        addl $_GLOBAL_OFFSET_TABLE_+[.-.L7],%ebx
        addl $-12,%esp
        leal -4(%ebp),%eax
        pushl %eax
        call pa@PLT
        addl $16,%esp
        xorl %eax,%eax
        jmp .L5
        .align 4
.L5:
        movl -24(%ebp),%ebx
        movl %ebp,%esp
        popl %ebp
        ret

With gcc5, the pointer is properly aligned. However, this seems to be gcc own's feature:

main:
        leal    4(%esp), %ecx
        andl    $-16, %esp
        pushl   -4(%ecx)
        pushl   %ebp
        movl    %esp, %ebp
        pushl   %ebx
        pushl   %ecx
        subl    $16, %esp
        call    __x86.get_pc_thunk.ax
        addl    $_GLOBAL_OFFSET_TABLE_, %eax
        subl    $12, %esp
        leal    -24(%ebp), %edx
        pushl   %edx
        movl    %eax, %ebx
        call    pa@PLT
        addl    $16, %esp
        movl    $0, %eax
        leal    -8(%ebp), %esp
        popl    %ecx
        popl    %ebx
        popl    %ebp
        leal    -4(%ecx), %esp
        ret

The second instruction realigns the stack, no matter its initial aligment.

This "attribute aligned" is also what ffmpeg uses to detect if the compiler and runtime properly aligns stack variables. I have modified the stack allocation so that the simple test has a properly aligned result, but then ffmpeg still complained that its variables were misaligned. So there must be something else at play.

Version 0, edited 7 years ago by pulkomandy (next)

comment:24 by axeld, 7 years ago

Owner: changed from axeld to nobody
Status: newassigned

comment:25 by pulkomandy, 6 years ago

We are still not quite there even on 32-bit x86. Here is a revised test program:

#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <OS.h>

#define ALIGNED __attribute__((aligned(16)))
//#define ALIGNED

void pa(const char* from, void* a)
{
        if ((uintptr_t)a & 0xF)
                printf("In %s, stack is NOT aligned: %x\n", from, a);
        else
                printf("In %s, stack is aligned!\n", from);
}

int function(void)
{
        int ALIGNED foo;
        pa("function", &foo);

        return 0;
}

status_t thread(void* arg)
{
        int ALIGNED foo;
        pa("thread", &foo);

        return B_OK;
}

void handler(int param)
{
        int ALIGNED foo;
        pa("signal", &foo);
}

int main(void)
{
        // Test from main
        int ALIGNED foo;
        pa("main", &foo);

        // Test from called function
        function();

        // Test from thread
        {
                status_t rv;
                wait_for_thread(spawn_thread(thread, "test", B_NORMAL_PRIORITY, NULL), &rv);
        }

        // Test from signal handler
        {
                stack_t signalStack;
                struct sigaction action;

                signalStack.ss_sp = malloc(SIGSTKSZ);
                signalStack.ss_flags = 0;
                signalStack.ss_size = SIGSTKSZ;
                sigaltstack(&signalStack, NULL);

                action.sa_handler = handler;
                action.sa_mask = 0;
                action.sa_flags = SA_ONSTACK;
                sigaction(SIGUSR1, &action, NULL);

                kill(getpid(), SIGUSR1);
        }

        return 0;
}

Results on a current system (the _noattr ones do not use attribute((aligned))):

[x86_gcc2] ~# ./gcc2_noattr 
In main, stack is NOT aligned: 72873464
In function, stack is NOT aligned: 72873424
In thread, stack is NOT aligned: 71820b44
In signal, stack is NOT aligned: 18502a34
[x86_gcc2] ~# ./gcc2-attr 
In main, stack is NOT aligned: 71b0f314
In function, stack is NOT aligned: 71b0f2d4
In thread, stack is NOT aligned: 722483e4
In signal, stack is NOT aligned: 183ecae4
[x86_gcc2] ~# ./gcc5_noattr 
In main, stack is NOT aligned: 728cdf6c
In function, stack is NOT aligned: 728cdf2c
In thread, stack is NOT aligned: 70eceb1c
In signal, stack is NOT aligned: 185c0d3c
[x86_gcc2] ~# ./gcc5-attr 
In main, stack is aligned!
In function, stack is aligned!
In thread, stack is aligned!
In signal, stack is aligned!

So, this works only for gcc5, and only if asking the compiler to realign the stack.

I then made the following change:

--- a/src/system/kernel/arch/x86/32/thread.cpp
+++ b/src/system/kernel/arch/x86/32/thread.cpp
@@ -114,8 +114,8 @@ arch_randomize_stack_pointer(addr_t value)
 {
        STATIC_ASSERT(MAX_RANDOM_VALUE >= B_PAGE_SIZE - 1);
        value -= random_value() & (B_PAGE_SIZE - 1);
-       return (value & ~addr_t(0xf)) - 4;
-               // This means, result % 16 == 12, which is what esp should adhere to
+       return (value & ~addr_t(0xf)) - 8;
+               // This means, result % 16 == 8, which is what esp should adhere to
                // when a function is entered for the stack to be considered aligned to
                // 16 byte.
 }

Results:

/Haiku/home> ./gcc2-attr 
In main, stack is aligned!
In function, stack is aligned!
In thread, stack is aligned!
In signal, stack is aligned!   
/Haiku/home> ./gcc2_noattr 
In main, stack is aligned!
In function, stack is aligned!
In thread, stack is aligned!
In signal, stack is aligned!
/Haiku/home> ./gcc5-attr 
In main, stack is aligned!
In function, stack is aligned!
In thread, stack is NOT aligned: 71f874ac
In signal, stack is NOT aligned: 187c98fc
/Haiku/home> ./gcc5_noattr 
In main, stack is NOT aligned: 72ad2e2c
In function, stack is NOT aligned: 72ad2dec
In thread, stack is NOT aligned: 70e06e98
In signal, stack is NOT aligned: 18fd6d98

So, this fixed the problem for gcc2 in all cases. However, gcc5 is now broken, even when asking the compiler to align the variables explicitly (for threads and signal stacks).

This seems related to the issue Simon South patched for x86_64: the 32-bit crti.S is pushing only 12 bytes to the stack (EBP, 8(EBP), then return address) before calling the next function. So, it breaks the stack alignment. Trying to fix this by further offseting the initial stack pointer fixes the main thread, but then other threads and sinal handlers become misaligned.

I tried to fix crti.S by pushing an extra value to the stack, but then launch_daemon crashed when I tried to boot the resulting image. I must have missed something?

comment:26 by jua, 6 years ago

There are two separate issues of alignment here:

1) The alignment of local variables on the stack. The default alignment for the various data types is defined by the compiler's ABI. If a greater alignment is required, it must be specified manually with e.g. the "aligned" attribute. Note that the "aligned" attribute is broken on gcc2 for stack variables. That's e.g. why ffmpeg always complains. It was only fixed in later gcc versions, see https://gcc.gnu.org/bugzilla/show_bug.cgi?id=16660 I once looked into backporting that support to gcc2, but gave up quickly, the amount of code difference is large.

The compiler is free to layout the local variables on the stack the way it wants to. There is no determinism where the first local variable of a function will end up, and in any way, when the function prologue was run, things like the previous EBP will already be on the stack anyway (moving the stack pointer). This means that PulkoMandy's test program does not show anything's wrong: it only checks if a local variable is aligned to 16. But this usually won't be the case unless you request it explicitly with the "aligned" attribute - that's not a bug.

Here's a version of the test program that will actually check the stack pointer value, and for me at least (on 32-bit), it reports everything is fine:

#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <OS.h>


uint32 gStackPointer;

#define CHECK_STACK_ALIGN(from) \
	asm volatile ("mov %%esp, %0" : "=r" (gStackPointer) :); \
	if (gStackPointer & 0xF) \
		printf("In %s, stack is NOT aligned: %x\n", from, gStackPointer); \
	else \
		printf("In %s, stack is aligned!\n", from);

int function(void)
{
	CHECK_STACK_ALIGN("function");

	return 0;
}

status_t thread(void* arg)
{
	CHECK_STACK_ALIGN("thread");

	return B_OK;
}

void handler(int param)
{
	CHECK_STACK_ALIGN("signal");
}

int main(void)
{
	CHECK_STACK_ALIGN("main");

	// Test from called function
	function();

	// Test from thread
	{
		status_t rv;
		wait_for_thread(spawn_thread(thread, "test", B_NORMAL_PRIORITY, NULL), &rv);
	}

	// Test from signal handler
	{
		stack_t signalStack;
		struct sigaction action;

		signalStack.ss_sp = malloc(SIGSTKSZ);
		signalStack.ss_flags = 0;
		signalStack.ss_size = SIGSTKSZ;
		sigaltstack(&signalStack, NULL);

		action.sa_handler = handler;
		action.sa_mask = 0;
		action.sa_flags = SA_ONSTACK;
		sigaction(SIGUSR1, &action, NULL);

		kill(getpid(), SIGUSR1);
	}

	return 0;
}

2) The alignment of the stack pointer that is initially passed off to code compiled by gcc. The patch made by Ingo is correct IMO as it creates an alignment just like gcc itself does it in the main() prologue. Changing the "-4" to "-8" would mess up the alignment.

comment:27 by pulkomandy, 6 years ago

The attempt to patch crti.S (does not work, launch daemon dereferences address 0x8 or 0x10 at boot):

--- a/src/system/glue/arch/x86/crti.S
+++ b/src/system/glue/arch/x86/crti.S
@@ -23,6 +23,7 @@
 FUNCTION(_init):
        pushl   %ebp
        movl    %esp, %ebp
+       sub     4,%esp          // Keep stack aligned
        pushl   8(%ebp)         // put image ID on the stack again
        call    __haiku_init_before
        // crtbegin.o stuff comes here

I find that crti.S is worrying because it will perform the following operations:

  • pushl, adds 4 bytes to the stack
  • pushl, adds 4 bytes to the stack
  • call, adds 4 bytes to the stack (return value)

So, this code pushes 12 bytes to the stack, and does not preserve stack alignment. However, gcc5 main prologue (I already listed that in an above comment) will compensate for this, and realign the stack. This would mean the stack is only aligned after entering main().

And indeed we can add this to the test program:

struct Foo {
        Foo()
        {
                CHECK_STACK_ALIGN("init");
        }
};

Foo init;

With this result:

In init, stack is NOT aligned: 72b40ac4
In main, stack is aligned!
In function, stack is aligned!
In thread, stack is aligned!
In signal, stack is aligned!

(with both gcc2 and gcc5).

I conclude that:

  • The alignment in randomize_stack_pointer is indeed correct and should not be changed (it works for signal stack and non-main threads)
  • crti.S breaks the alignment and should be fixed
  • we don't notice this too badly, because gcc will realign the stack when entering the main function, but global constructors (and possibly global destructors) are affected.

comment:28 by pulkomandy, 6 years ago

Made the change to crti.S in hrev51664. I had missed a $ in the code above to mark the 4 as immediate. This fixes alignment for init and fini code.

Note that this code is inlined in compiled executables, so all executables which need an aligned stack in these places should be recompiled.

comment:29 by pulkomandy, 6 years ago

patch: 10

comment:30 by pulkomandy, 6 years ago

patch: 0

Remaining patch migrated to Gerrit: https://review.haiku-os.org/39

comment:31 by pulkomandy, 6 years ago

patch: 0
Resolution: fixed
Status: assignedclosed

Patch is merged in hrev51774. Only the signal stack was still misaligned.

Note: See TracTickets for help on using tickets.