Opened 4 months ago
Last modified 7 weeks ago
#18947 new bug
shell exec is about 30 times slower than raw exec syscall
Reported by: | X512 | Owned by: | nobody |
---|---|---|---|
Priority: | normal | Milestone: | Unscheduled |
Component: | System/libroot.so | Version: | R1/Development |
Keywords: | Cc: | ||
Blocked By: | Blocking: | ||
Platform: | All |
Description (last modified by )
This is hrev57820.
Shell (Bash/Zsh etc.) execution time is very slow compared Linux/*BSD that significantly slow down Autotools based projects compilation and other tasks that heavily use shell scripts. Raw exec syscall is significantly faster (30x). I am not sure is it kernel, libroot or shell port problem, it need more investigation.
UPDATE: problem is related to POSIX locale support. Setting LC_ALL=C
significantly speed up shell execution.
loop.sh
:
#!sh if (( $1 <= 0 )) then exit fi exec loop.sh $(($1 - 1))
loop.c
:
#include <stdio.h> #include <stdlib.h> #include <unistd.h> int main(int argc, char **argv) { if (argc < 2) return 1; int n = atoi(argv[1]); if (n <= 0) return 0; char buf[32]; sprintf(buf, "%d", n - 1); char *newArgs[3] = {argv[0], buf, NULL}; execvp(newArgs[0], newArgs); return 0; }
time ./loop.sh 1000
:
real 0m13.775s user 0m11.713s sys 0m1.875s
time ./loop 1000
:
real 0m0.409s user 0m0.144s sys 0m0.240s
export LC_ALL=C
time ./loop.sh 1000
:
real 0m1.470s user 0m0.878s sys 0m0.546s
Change History (15)
comment:1 by , 4 months ago
Description: | modified (diff) |
---|
comment:2 by , 4 months ago
Component: | System/Kernel → System |
---|
comment:4 by , 4 months ago
And on WSL1 (on the host):
time ./loop.sh 1000 real 0m4.247s user 0m0.422s sys 0m2.875s
time ./loop 1000 real 0m1.869s user 0m0.031s sys 0m1.031s
comment:6 by , 4 months ago
Description: | modified (diff) |
---|
comment:7 by , 4 months ago
After fixing the profiler (though admittedly we may still be missing ticks...), we can now clearly see that most of the time is spent in loading the locale backend, over half of it in runtime_loader ELF symbol resolution and the rest in the locale initialization.
The only critical reason we absolutely must load the locale backend so early is because we need to initialize a lot of static data (like CTYPE and others). If we had a way to share this data between processes, we could avoid loading the locale backend until the first invocation of an actual locale-dependent method, speeding things up a lot.
comment:8 by , 4 months ago
Component: | System → System/libroot.so |
---|
comment:9 by , 4 months ago
https://review.haiku-os.org/c/haiku/+/7855 does help a bit here but not as much as I'd hoped it would.
comment:10 by , 3 months ago
DT_GNU_HASH implemented in hrev57868. After rebuilding ICU this does improve time somewhat. The change to adopt the optimized SVR4 hash routine also improved things by a few percent even without rebuilding anything.
comment:11 by , 3 months ago
It looks like we could defer loading of the locale until it is actually used. This would be a bit tricky but would have the biggest performance advantage, of course.
comment:13 by , 7 weeks ago
I switched ICU to build with -Bsymbolic-functions in https://github.com/haikuports/haikuports/commit/1d32ee97ee0830d2bb9d4dda91e8bd49735701a4. This seems to improve times by about ~0.6 seconds for me (down to ~14.5 instead of ~15.1 at best, ~15.5 more typical.)
comment:14 by , 7 weeks ago
over half of it in runtime_loader ELF symbol resolution
After the changes to ICU's compiler flags, this no longer seems to be the case; we now spend < 30% (and maybe < 20%, hard to tell due to profiler flushes occurring in image_relocated) in runtime_loader in total.
comment:15 by , 7 weeks ago
It appears that (at least on my system) about 30% of the profiled time is spent in ICU's ulocimp_getRegionForSupplementalData
, which loads a lot of "likely subtags". This method is used whenever the locale ID doesn't contain a region (country), only a language. If we specify a language up-front, this saves a lot of time; I exported "en_US.UTF-8" instead of the default "en.UTF-8", and this saved 0.5-1.0 seconds of the runtime.
For reference, here's what I get for both these benchmarks on Linux (in a VM):