Opened 7 months ago

Last modified 2 weeks ago

#18947 new bug

shell exec is about 30 times slower than raw exec syscall

Reported by: X512 Owned by: nobody
Priority: normal Milestone: Unscheduled
Component: System/libroot.so Version: R1/Development
Keywords: Cc:
Blocked By: Blocking:
Platform: All

Description (last modified by X512)

This is hrev57820.

Shell (Bash/Zsh etc.) execution time is very slow compared Linux/*BSD that significantly slow down Autotools based projects compilation and other tasks that heavily use shell scripts. Raw exec syscall is significantly faster (30x). I am not sure is it kernel, libroot or shell port problem, it need more investigation.

UPDATE: problem is related to POSIX locale support. Setting LC_ALL=C significantly speed up shell execution.

loop.sh:

#!sh

if (( $1 <= 0 ))
then
    exit
fi

exec loop.sh $(($1 - 1))

loop.c:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>


int main(int argc, char **argv)
{
    if (argc < 2)
        return 1;

    int n = atoi(argv[1]);
    if (n <= 0)
        return 0;
    
    char buf[32];
    sprintf(buf, "%d", n - 1);
    char *newArgs[3] = {argv[0], buf, NULL};
    execvp(newArgs[0], newArgs);
    
    return 0;
}

time ./loop.sh 1000:

real    0m13.775s
user    0m11.713s
sys     0m1.875s

time ./loop 1000:

real    0m0.409s
user    0m0.144s
sys     0m0.240s

export LC_ALL=C time ./loop.sh 1000:

real    0m1.470s
user    0m0.878s
sys     0m0.546s

Change History (20)

comment:1 by X512, 7 months ago

Description: modified (diff)

comment:2 by korli, 7 months ago

Component: System/KernelSystem

comment:3 by waddlesplash, 7 months ago

For reference, here's what I get for both these benchmarks on Linux (in a VM):

time ./loop.sh 1000

real	0m0.930s
user	0m0.111s
sys	0m0.804s
time ./loop 1000

real	0m0.432s
user	0m0.047s
sys	0m0.373s
Last edited 7 months ago by waddlesplash (previous) (diff)

comment:4 by waddlesplash, 7 months ago

And on WSL1 (on the host):

time ./loop.sh 1000

real    0m4.247s
user    0m0.422s
sys     0m2.875s
time ./loop 1000

real    0m1.869s
user    0m0.031s
sys     0m1.031s

comment:5 by waddlesplash, 7 months ago

hrev57827 may help a little bit here.

comment:6 by X512, 7 months ago

Description: modified (diff)

comment:7 by waddlesplash, 7 months ago

After fixing the profiler (though admittedly we may still be missing ticks...), we can now clearly see that most of the time is spent in loading the locale backend, over half of it in runtime_loader ELF symbol resolution and the rest in the locale initialization.

The only critical reason we absolutely must load the locale backend so early is because we need to initialize a lot of static data (like CTYPE and others). If we had a way to share this data between processes, we could avoid loading the locale backend until the first invocation of an actual locale-dependent method, speeding things up a lot.

comment:8 by waddlesplash, 7 months ago

Component: SystemSystem/libroot.so

comment:9 by waddlesplash, 7 months ago

https://review.haiku-os.org/c/haiku/+/7855 does help a bit here but not as much as I'd hoped it would.

comment:10 by waddlesplash, 7 months ago

DT_GNU_HASH implemented in hrev57868. After rebuilding ICU this does improve time somewhat. The change to adopt the optimized SVR4 hash routine also improved things by a few percent even without rebuilding anything.

comment:11 by waddlesplash, 7 months ago

It looks like we could defer loading of the locale until it is actually used. This would be a bit tricky but would have the biggest performance advantage, of course.

comment:12 by waddlesplash, 7 months ago

hrev57870 should improve things a bit more here.

comment:13 by waddlesplash, 5 months ago

I switched ICU to build with -Bsymbolic-functions in https://github.com/haikuports/haikuports/commit/1d32ee97ee0830d2bb9d4dda91e8bd49735701a4. This seems to improve times by about ~0.6 seconds for me (down to ~14.5 instead of ~15.1 at best, ~15.5 more typical.)

comment:14 by waddlesplash, 5 months ago

over half of it in runtime_loader ELF symbol resolution

After the changes to ICU's compiler flags, this no longer seems to be the case; we now spend < 30% (and maybe < 20%, hard to tell due to profiler flushes occurring in image_relocated) in runtime_loader in total.

comment:15 by waddlesplash, 5 months ago

It appears that (at least on my system) about 30% of the profiled time is spent in ICU's ulocimp_getRegionForSupplementalData, which loads a lot of "likely subtags". This method is used whenever the locale ID doesn't contain a region (country), only a language. If we specify a language up-front, this saves a lot of time; I exported "en_US.UTF-8" instead of the default "en.UTF-8", and this saved 0.5-1.0 seconds of the runtime.

comment:16 by waddlesplash, 3 weeks ago

We should investigate whether lazily loading the locale data would help here. Whether lazily loading just some kinds of data, or lazily loading the locale backend at all (which would be ideal, but I don't know if these applications are making use if it or not.)

comment:17 by pulkomandy, 2 weeks ago

If I remember correctly, loading of locale data is already delayed, but calling any C library function that depends on the locale will load it. So it's hard to escape that.

With the data being in a shared library, could we save time by sharing that more efficiently? Or, failing tqat, should the data be stored in an area that all applications could share? Then there is no need to load it over and over. But that may require patching icu.

It appears that (at least on my system) about 30% of the profiled time is spent in ICU's ulocimp_getRegionForSupplementalData, which loads a lot of "likely subtags". This method is used whenever the locale ID doesn't contain a region (country), only a language. If we specify a language up-front, this saves a lot of time; I exported "en_US.UTF-8" instead of the default "en.UTF-8", and this saved 0.5-1.0 seconds of the runtime.

This seems like an easy win, the locale preferences could make sure we have a "fully qualified" locale and not a partial one, then?

comment:18 by waddlesplash, 2 weeks ago

In the most common case here, the locale data is being loaded when an application calls setlocale("") on startup. This means "load the locale specified in environment variables." In this particular case I think we might be able to defer loading the locale if the application doesn't end up using anything; but it might be the case that basically all applications which call this (like shells) actually do use things, so maybe that wouldn't help anyway.

comment:19 by X512, 2 weeks ago

With the data being in a shared library, could we save time by sharing that more efficiently?

If data is stored in read-only ELF segment without relocations, it should be already shared efficient enough.

comment:20 by waddlesplash, 2 weeks ago

The problem isn't loading the data, the problem is that ICU does a lot of manipulation of the data on startup. That's why it's so slow. The mmap() calls are quite fast.

Note: See TracTickets for help on using tickets.