Opened 18 months ago

Closed 18 months ago

Last modified 18 months ago

#13744 closed bug (fixed)

Kernel PANIC stack fault exception from a userland C++ app

Reported by: phoudoin Owned by: nobody
Priority: critical Milestone: Unscheduled
Component: System/Kernel Version: R1/Development
Keywords: kdl ss cpp Cc:
Blocked By: Blocking:
Has a Patch: no Platform: x86-64

Description

Consider stack_fault.c source file:

#include <sys/select.h>

int main(int argc, char* argv[])
{
  fd_set errorSet;

  // let's trigger a stack segment fault
  FD_SET(-1, &errorSet);
  return 0;
}

When compiled as a C module:

gcc -o stack_fault_c stack_fault.c

running stack_fault_c would crash - as expected - on a segmentation fault.

But when compiled as a C++ module:

g++ -o stack_fault_cpp stack_fault.c

running stack_fault_cpp trigger a KDL PANIC, which is neither expected or desirable. Screenshot attached.

Running Haiku x86_64 hrev51474 under VirtualBox, 2 vCPU, 4GB of RAM.

Attachments (2)

VirtualBox_Haiku x86_64_20_10_2017_16_38_24.png (121.9 KB) - added by phoudoin 18 months ago.
KDL Panic dump
handle-stack-fault.diff (1.8 KB) - added by jua 18 months ago.

Download all attachments as: .zip

Change History (18)

Changed 18 months ago by phoudoin

KDL Panic dump

comment:1 Changed 18 months ago by phoudoin

Has a Patch: set

comment:2 Changed 18 months ago by phoudoin

Owner: changed from nobody to korli
Status: newassigned

comment:3 Changed 18 months ago by korli

Owner: changed from korli to nobody

Please don't change the default assignee.

comment:4 Changed 18 months ago by pulkomandy

FD_SET is just accesing an array. So the code is roughly equivalent to:

int main(int argc, char* argv[])
{
  int foo[1];

  // let's trigger a stack segment fault
  foo[-1] |= 1;
  return 0;
}

This will overwrite whatever is on the stack after the foo array (because the stack grows downwards). We can't make this a systematic segmentation fault, because the application is allowed to write to the stack and the array is not necessarily on a page boundary (there may be other local variables after it). What happens once the stack is corrupt is unspecified, but to understand how we end up in KDL we will have to disassemble the code and go through it step by step to find out what gets corrupt.

It seems the kernel does detect the problem but is not sure what to do (crash/kill the team, I guess?). Or maybe some important data structure stored near the stack was erased?

comment:5 Changed 18 months ago by pulkomandy

Has a Patch: unset

comment:6 Changed 18 months ago by jua

Due to varying signedness of the various operands in the macros, it will actually end up accessing a really large memory address, so the "foo[-1]" doesn't work to reproduce.

Here's a minimal example that reliably reproduces this KDL on x64-Haiku for me:

#include <SupportDefs.h>

int main()
{
    int i[1];
    uint64 offset = 0x10000000000;
    i[offset] = 1;
    return 0;
}

comment:7 Changed 18 months ago by pulkomandy

Then it's just writing at essentially a random address. If there is nothing mapped there it will segfault, otherwise it will corrupt some other data. So we need to identify what's at that address and how corrupting it puts us into KDL.

comment:8 Changed 18 months ago by jua

Ok, looking a bit further, this is a problem with our x86-64 exception handling, which does not properly handle the Stack-Fault (#SS) exception.

x86-64 has 64-bit addresses, but in practice, only 48 of these bits are really usable, the rest of the address bits must be a copy of bit 47. This is the "canonical address format". Violating this rule will generate either a #GP or an #SS exception. Quoting from the Intel Architecture manual, vol. 1 ch. 3.3.7.1:

The first implementation of IA-32 processors with Intel 64 architecture supports a 48-bit linear address. This means a canonical address must have bits 63 through 48 set to zeros or ones (depending on whether bit 47 is a zero or one). [...] If a linear-memory reference is not in canonical form, the implementation should generate an exception. In most cases, a general-protection exception (#GP) is generated. However, in the case of explicit or implied stack references, a stack fault (#SS) is generated.

We handle the case of #GP exceptions just fine. However, what the code in this ticket does is referencing the stack (by doing an address calculation involving ebp), so it generates an #SS exception. Currently, we handle that one with x86_fatal_exception(), which triggers KDL.

So what we need is to properly handle the #SS exception to only terminate the application in such cases.

Changed 18 months ago by jua

Attachment: handle-stack-fault.diff added

comment:9 Changed 18 months ago by jua

Has a Patch: set

comment:10 Changed 18 months ago by jua

Has a Patch: unset

Attached a patch. Fixes the bug - comments welcome. I'm not sure if the signal code is the right one, but it seems that none of the POSIX signals really fits here, so I tried to make a reasonable choice.

Last edited 18 months ago by jua (previous) (diff)

comment:11 Changed 18 months ago by korli

Linux seems to use SIGBUS. "Access to an undefined portion of a memory object." according to http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/signal.h.html. Ends with "Abnormal termination of the process".

comment:12 Changed 18 months ago by jua

Ok, FreeBSD uses that too, so that's a better choice.

comment:13 Changed 18 months ago by phoudoin

So, SIGBUS, BUS_ADRERR, then? Otherwise, the patch looks fine.

Last edited 18 months ago by phoudoin (previous) (diff)

comment:14 Changed 18 months ago by jua

Resolution: fixed
Status: assignedclosed

Applied in hrev51511.

p.s. oops, forgot to set git correct user/mail on new machine prior to commit...

comment:15 Changed 18 months ago by phoudoin

Any idea why the same code KDL when compiled as c++ code but not as C code? Something different in the compiler generated code I guess...

comment:16 Changed 18 months ago by jua

Yup, if you look at the disassembly of both, gcc generates quite different code. In C language mode, the crash-inducing instruction doesn't contain a stack reference, so it generates a general protection fault instead of a stack fault.

Note: See TracTickets for help on using tickets.