Opened 4 months ago

Last modified 7 weeks ago

#18947 new bug

shell exec is about 30 times slower than raw exec syscall

Reported by: X512 Owned by: nobody
Priority: normal Milestone: Unscheduled
Component: System/libroot.so Version: R1/Development
Keywords: Cc:
Blocked By: Blocking:
Platform: All

Description (last modified by X512)

This is hrev57820.

Shell (Bash/Zsh etc.) execution time is very slow compared Linux/*BSD that significantly slow down Autotools based projects compilation and other tasks that heavily use shell scripts. Raw exec syscall is significantly faster (30x). I am not sure is it kernel, libroot or shell port problem, it need more investigation.

UPDATE: problem is related to POSIX locale support. Setting LC_ALL=C significantly speed up shell execution.

loop.sh:

#!sh

if (( $1 <= 0 ))
then
    exit
fi

exec loop.sh $(($1 - 1))

loop.c:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>


int main(int argc, char **argv)
{
    if (argc < 2)
        return 1;

    int n = atoi(argv[1]);
    if (n <= 0)
        return 0;
    
    char buf[32];
    sprintf(buf, "%d", n - 1);
    char *newArgs[3] = {argv[0], buf, NULL};
    execvp(newArgs[0], newArgs);
    
    return 0;
}

time ./loop.sh 1000:

real    0m13.775s
user    0m11.713s
sys     0m1.875s

time ./loop 1000:

real    0m0.409s
user    0m0.144s
sys     0m0.240s

export LC_ALL=C time ./loop.sh 1000:

real    0m1.470s
user    0m0.878s
sys     0m0.546s

Change History (15)

comment:1 by X512, 4 months ago

Description: modified (diff)

comment:2 by korli, 4 months ago

Component: System/KernelSystem

comment:3 by waddlesplash, 4 months ago

For reference, here's what I get for both these benchmarks on Linux (in a VM):

time ./loop.sh 1000

real	0m0.930s
user	0m0.111s
sys	0m0.804s
time ./loop 1000

real	0m0.432s
user	0m0.047s
sys	0m0.373s
Last edited 4 months ago by waddlesplash (previous) (diff)

comment:4 by waddlesplash, 4 months ago

And on WSL1 (on the host):

time ./loop.sh 1000

real    0m4.247s
user    0m0.422s
sys     0m2.875s
time ./loop 1000

real    0m1.869s
user    0m0.031s
sys     0m1.031s

comment:5 by waddlesplash, 4 months ago

hrev57827 may help a little bit here.

comment:6 by X512, 4 months ago

Description: modified (diff)

comment:7 by waddlesplash, 4 months ago

After fixing the profiler (though admittedly we may still be missing ticks...), we can now clearly see that most of the time is spent in loading the locale backend, over half of it in runtime_loader ELF symbol resolution and the rest in the locale initialization.

The only critical reason we absolutely must load the locale backend so early is because we need to initialize a lot of static data (like CTYPE and others). If we had a way to share this data between processes, we could avoid loading the locale backend until the first invocation of an actual locale-dependent method, speeding things up a lot.

comment:8 by waddlesplash, 4 months ago

Component: SystemSystem/libroot.so

comment:9 by waddlesplash, 4 months ago

https://review.haiku-os.org/c/haiku/+/7855 does help a bit here but not as much as I'd hoped it would.

comment:10 by waddlesplash, 3 months ago

DT_GNU_HASH implemented in hrev57868. After rebuilding ICU this does improve time somewhat. The change to adopt the optimized SVR4 hash routine also improved things by a few percent even without rebuilding anything.

comment:11 by waddlesplash, 3 months ago

It looks like we could defer loading of the locale until it is actually used. This would be a bit tricky but would have the biggest performance advantage, of course.

comment:12 by waddlesplash, 3 months ago

hrev57870 should improve things a bit more here.

comment:13 by waddlesplash, 7 weeks ago

I switched ICU to build with -Bsymbolic-functions in https://github.com/haikuports/haikuports/commit/1d32ee97ee0830d2bb9d4dda91e8bd49735701a4. This seems to improve times by about ~0.6 seconds for me (down to ~14.5 instead of ~15.1 at best, ~15.5 more typical.)

comment:14 by waddlesplash, 7 weeks ago

over half of it in runtime_loader ELF symbol resolution

After the changes to ICU's compiler flags, this no longer seems to be the case; we now spend < 30% (and maybe < 20%, hard to tell due to profiler flushes occurring in image_relocated) in runtime_loader in total.

comment:15 by waddlesplash, 7 weeks ago

It appears that (at least on my system) about 30% of the profiled time is spent in ICU's ulocimp_getRegionForSupplementalData, which loads a lot of "likely subtags". This method is used whenever the locale ID doesn't contain a region (country), only a language. If we specify a language up-front, this saves a lot of time; I exported "en_US.UTF-8" instead of the default "en.UTF-8", and this saved 0.5-1.0 seconds of the runtime.

Note: See TracTickets for help on using tickets.