Opened 7 months ago
Last modified 2 weeks ago
#18947 new bug
shell exec is about 30 times slower than raw exec syscall
Reported by: | X512 | Owned by: | nobody |
---|---|---|---|
Priority: | normal | Milestone: | Unscheduled |
Component: | System/libroot.so | Version: | R1/Development |
Keywords: | Cc: | ||
Blocked By: | Blocking: | ||
Platform: | All |
Description (last modified by )
This is hrev57820.
Shell (Bash/Zsh etc.) execution time is very slow compared Linux/*BSD that significantly slow down Autotools based projects compilation and other tasks that heavily use shell scripts. Raw exec syscall is significantly faster (30x). I am not sure is it kernel, libroot or shell port problem, it need more investigation.
UPDATE: problem is related to POSIX locale support. Setting LC_ALL=C
significantly speed up shell execution.
loop.sh
:
#!sh if (( $1 <= 0 )) then exit fi exec loop.sh $(($1 - 1))
loop.c
:
#include <stdio.h> #include <stdlib.h> #include <unistd.h> int main(int argc, char **argv) { if (argc < 2) return 1; int n = atoi(argv[1]); if (n <= 0) return 0; char buf[32]; sprintf(buf, "%d", n - 1); char *newArgs[3] = {argv[0], buf, NULL}; execvp(newArgs[0], newArgs); return 0; }
time ./loop.sh 1000
:
real 0m13.775s user 0m11.713s sys 0m1.875s
time ./loop 1000
:
real 0m0.409s user 0m0.144s sys 0m0.240s
export LC_ALL=C
time ./loop.sh 1000
:
real 0m1.470s user 0m0.878s sys 0m0.546s
Change History (20)
comment:1 by , 7 months ago
Description: | modified (diff) |
---|
comment:2 by , 7 months ago
Component: | System/Kernel → System |
---|
comment:4 by , 7 months ago
And on WSL1 (on the host):
time ./loop.sh 1000 real 0m4.247s user 0m0.422s sys 0m2.875s
time ./loop 1000 real 0m1.869s user 0m0.031s sys 0m1.031s
comment:6 by , 7 months ago
Description: | modified (diff) |
---|
comment:7 by , 7 months ago
After fixing the profiler (though admittedly we may still be missing ticks...), we can now clearly see that most of the time is spent in loading the locale backend, over half of it in runtime_loader ELF symbol resolution and the rest in the locale initialization.
The only critical reason we absolutely must load the locale backend so early is because we need to initialize a lot of static data (like CTYPE and others). If we had a way to share this data between processes, we could avoid loading the locale backend until the first invocation of an actual locale-dependent method, speeding things up a lot.
comment:8 by , 7 months ago
Component: | System → System/libroot.so |
---|
comment:9 by , 7 months ago
https://review.haiku-os.org/c/haiku/+/7855 does help a bit here but not as much as I'd hoped it would.
comment:10 by , 7 months ago
DT_GNU_HASH implemented in hrev57868. After rebuilding ICU this does improve time somewhat. The change to adopt the optimized SVR4 hash routine also improved things by a few percent even without rebuilding anything.
comment:11 by , 7 months ago
It looks like we could defer loading of the locale until it is actually used. This would be a bit tricky but would have the biggest performance advantage, of course.
comment:13 by , 5 months ago
I switched ICU to build with -Bsymbolic-functions in https://github.com/haikuports/haikuports/commit/1d32ee97ee0830d2bb9d4dda91e8bd49735701a4. This seems to improve times by about ~0.6 seconds for me (down to ~14.5 instead of ~15.1 at best, ~15.5 more typical.)
comment:14 by , 5 months ago
over half of it in runtime_loader ELF symbol resolution
After the changes to ICU's compiler flags, this no longer seems to be the case; we now spend < 30% (and maybe < 20%, hard to tell due to profiler flushes occurring in image_relocated) in runtime_loader in total.
comment:15 by , 5 months ago
It appears that (at least on my system) about 30% of the profiled time is spent in ICU's ulocimp_getRegionForSupplementalData
, which loads a lot of "likely subtags". This method is used whenever the locale ID doesn't contain a region (country), only a language. If we specify a language up-front, this saves a lot of time; I exported "en_US.UTF-8" instead of the default "en.UTF-8", and this saved 0.5-1.0 seconds of the runtime.
comment:16 by , 3 weeks ago
We should investigate whether lazily loading the locale data would help here. Whether lazily loading just some kinds of data, or lazily loading the locale backend at all (which would be ideal, but I don't know if these applications are making use if it or not.)
comment:17 by , 2 weeks ago
If I remember correctly, loading of locale data is already delayed, but calling any C library function that depends on the locale will load it. So it's hard to escape that.
With the data being in a shared library, could we save time by sharing that more efficiently? Or, failing tqat, should the data be stored in an area that all applications could share? Then there is no need to load it over and over. But that may require patching icu.
It appears that (at least on my system) about 30% of the profiled time is spent in ICU's ulocimp_getRegionForSupplementalData, which loads a lot of "likely subtags". This method is used whenever the locale ID doesn't contain a region (country), only a language. If we specify a language up-front, this saves a lot of time; I exported "en_US.UTF-8" instead of the default "en.UTF-8", and this saved 0.5-1.0 seconds of the runtime.
This seems like an easy win, the locale preferences could make sure we have a "fully qualified" locale and not a partial one, then?
comment:18 by , 2 weeks ago
In the most common case here, the locale data is being loaded when an application calls setlocale("")
on startup. This means "load the locale specified in environment variables." In this particular case I think we might be able to defer loading the locale if the application doesn't end up using anything; but it might be the case that basically all applications which call this (like shells) actually do use things, so maybe that wouldn't help anyway.
comment:19 by , 2 weeks ago
With the data being in a shared library, could we save time by sharing that more efficiently?
If data is stored in read-only ELF segment without relocations, it should be already shared efficient enough.
comment:20 by , 2 weeks ago
The problem isn't loading the data, the problem is that ICU does a lot of manipulation of the data on startup. That's why it's so slow. The mmap() calls are quite fast.
For reference, here's what I get for both these benchmarks on Linux (in a VM):