Opened 8 years ago
Closed 11 months ago
#13088 closed enhancement (fixed)
Performance: Investigate network I/O bottlenecks
Reported by: | kallisti5 | Owned by: | nobody |
---|---|---|---|
Priority: | normal | Milestone: | Unscheduled |
Component: | Network & Internet/TCP | Version: | R1/Development |
Keywords: | Cc: | ||
Blocked By: | Blocking: | ||
Platform: | All |
Description
There seems to be some pretty serious performance bottlenecks around the network stack or the disk drivers, or the BFS filesystem.
Change History (17)
comment:1 by , 8 years ago
comment:2 by , 8 years ago
Haiku test - wget https://cdn.kernel.org/pub/linux/kernel/v4.x/testing/linux-4.9-rc6.tar.xz
Haiku x86_64, hrev50707, HaikuPorts wget 1.18-1 x86_64
device Network controller (Ethernet controller) [2|0|0] vendor 10ec: Realtek Semiconductor Co., Ltd. device 8168: RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
- Haiku - x86_64 - GPT - BFS - SATA - 7.8MB/s
- Haiku - x86_64 - GPT - BFS - RAM Disk (200M) - 8.22MB/s
That seems to show the issue is somewhere either in the network stack, NIC driver, or BFS driver.
comment:3 by , 8 years ago
Looks like definitely network stack or network card driver.
Haiku x86_64, hrev50707
~> dd if=linux-4.9-rc6.tar.xz of=/RAM/linux-4.9-rc6.tar.xz 181945+1 records in 181945+1 records out 93156312 bytes (93 MB) copied, 0.764035 s, 122 MB/s
comment:5 by , 8 years ago
This is definitely caused by #13089. If I disable TCP window scaling on Linux, I get ~8MB/s downloads.
sudo sysctl -w net.ipv4.tcp_window_scaling=0
linux-4.9-rc6.tar.xz.1 42%[===> ] 37.76M 8.95MB/s eta 6s
comment:6 by , 8 years ago
Summary: | Performance: Investigate bottlenecks → Performance: Investigate network I/O bottlenecks |
---|
comment:7 by , 8 years ago
we actually do implement basic rfc1323 from what I can tell.. and wireshark shows tcp window scaling multipliers to be in use.
I did another transfer, wget from my local machine of a *larger* file and have some new behavior. The speed ramps up and down.
8MB/s, 7MB/s <10 or so seconds> 60MB/s <5 or so seconds> 2MB/s <10 or so seconds> 100MB/s <5 or so seconds> 2MB/s
So it seems like more of a congestion issue? No traffic on the local lan.
comment:8 by , 8 years ago
Component: | - General → Network & Internet/TCP |
---|---|
Owner: | changed from | to
comment:9 by , 8 years ago
Owner: | changed from | to
---|---|
Status: | new → assigned |
comment:10 by , 6 years ago
We have iperf in our repositories, assuming there is nothing wrong with our port, I ran a few tests:
iperf -s iperf -c 127.0.0.1
- Linux x86_64 4.18.10-200.fc28.x86_64:
- TCP window size: 2.50 MByte (default)
- [ 3] 0.0-10.0 sec 57.3 GBytes 49.3 Gbits/sec
- [ 3] 0.0-10.0 sec 57.0 GBytes 48.9 Gbits/sec
- TCP window size: 2.50 MByte (default)
- Haiku, x86_64 hrev52295-16-g039fdd9ffe (early beta1 branch)
- TCP window size: 32.0 KByte (default)
- Uptime 15 days, 22 hours, 46 minutes - exposed to the internet.
- [ 3] 0.0-10.0 sec 71.2 MBytes 59.7 Mbits/sec
- [ 3] 0.0-10.0 sec 44.7 MBytes 37.5 Mbits/sec
- Haiku, x86_64 hrev52295-16-g039fdd9ffe (early beta1 branch)
- TCP window size: 32.0 KByte (default)
- Uptime 1 minute
- [ 3] 0.0-10.0 sec 62.2 MBytes 52.2 Mbits/sec
- [ 3] 0.0-10.0 sec 58.6 MBytes 49.2 Mbits/sec
- Haiku, x86_64 hrev52295-16-g039fdd9ffe (early beta1 branch)
- TCP window size: 2.50 MByte
- Uptime 7 minutes
- [ 3] 0.0-10.0 sec 65.8 MBytes 55.2 Mbits/sec
- [ 3] 0.0-10.0 sec 33.6 MBytes 28.2 Mbits/sec
comment:11 by , 6 years ago
One interesting output is strace -c from iperf:
~> strace -c iperf -c 127.0.0.1 ------------------------------------------------------------ Client connecting to 127.0.0.1, TCP port 5001 TCP window size: 32.0 KByte (default) ------------------------------------------------------------ [ 3] local 127.0.0.1 port 40108 connected with 127.0.0.1 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-10.0 sec 58.9 MBytes 49.4 Mbits/sec Time % Usecs Calls Usecs/call Syscall ------ ---------- ------- ---------- -------------------- 100.00 10079874 3 3359958 _kern_mutex_switch_lock 0.00 128 1 128 _kern_image_relocated 0.00 23 2 11 _kern_spawn_thread 0.00 19 6 3 _kern_set_area_protection 0.00 4 1 4 _kern_create_area 0.00 3 1 3 _kern_exit_team 0.00 3 2 1 _kern_get_system_info 0.00 2 1 2 _kern_reserve_address_range 0.00 2 2 1 _kern_resume_thread 0.00 1 1 1 _kern_resize_area 0.00 0 1 0 _kern_get_next_image_info 0.00 0 4 0 _kern_sigaction
comment:12 by , 6 years ago
https://github.com/google/benchmark seems like it might be helpful to see where we're at vs Linux.
Linux 4.18.10-200.fc28.x86_64:
2018-10-05 09:42:27 Running ./test/basic_test Run on (8 X 3600 MHz CPU s) CPU Caches: L1 Data 32K (x4) L1 Instruction 32K (x4) L2 Unified 256K (x4) L3 Unified 8192K (x1) ***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead. ***WARNING*** Library was built as DEBUG. Timings may be affected. ------------------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------------------- BM_empty 3 ns 3 ns 233438098 BM_empty/threads:8 1 ns 6 ns 122371744 BM_spin_empty/8 46 ns 46 ns 15205421 BM_spin_empty/512 2695 ns 2692 ns 257997 BM_spin_empty/8192 43966 ns 43901 ns 16100 BM_spin_empty/8/threads:8 12 ns 95 ns 7261392 BM_spin_empty/512/threads:8 698 ns 5488 ns 130448 BM_spin_empty/8192/threads:8 11046 ns 86815 ns 7984 BM_spin_pause_before/8 46 ns 46 ns 15192276 BM_spin_pause_before/512 2693 ns 2689 ns 267952 BM_spin_pause_before/8192 42290 ns 42232 ns 16134 BM_spin_pause_before/8/threads:8 12 ns 97 ns 7226928 BM_spin_pause_before/512/threads:8 701 ns 5562 ns 128960 BM_spin_pause_before/8192/threads:8 11219 ns 88271 ns 7840 BM_spin_pause_during/8 458 ns 459 ns 1525514 BM_spin_pause_during/512 3326 ns 3325 ns 201580 BM_spin_pause_during/8192 45514 ns 45407 ns 15909 BM_spin_pause_during/8/threads:8 75 ns 594 ns 1175000 BM_spin_pause_during/512/threads:8 771 ns 6036 ns 116296 BM_spin_pause_during/8192/threads:8 11327 ns 89091 ns 7984 BM_pause_during 437 ns 437 ns 1636400 BM_pause_during/threads:8 66 ns 517 ns 1353320 BM_pause_during/real_time 425 ns 427 ns 1620308 BM_pause_during/real_time/threads:8 66 ns 517 ns 10697656 BM_spin_pause_after/8 54 ns 54 ns 10148460 BM_spin_pause_after/512 3067 ns 3051 ns 234550 BM_spin_pause_after/8192 47197 ns 47013 ns 13512 BM_spin_pause_after/8/threads:8 13 ns 97 ns 8500864 BM_spin_pause_after/512/threads:8 714 ns 5651 ns 127944 BM_spin_pause_after/8192/threads:8 11470 ns 88168 ns 8456 BM_spin_pause_before_and_after/8 54 ns 54 ns 13917278 BM_spin_pause_before_and_after/512 3030 ns 3016 ns 213215 BM_spin_pause_before_and_after/8192 43801 ns 43707 ns 16402 BM_spin_pause_before_and_after/8/threads:8 12 ns 95 ns 7654360 BM_spin_pause_before_and_after/512/threads:8 719 ns 5040 ns 138824 BM_spin_pause_before_and_after/8192/threads:8 11009 ns 84438 ns 8088 BM_empty_stop_start 2 ns 2 ns 327797067 BM_empty_stop_start/threads:8 0 ns 3 ns 275301568 BM_KeepRunning 3 ns 3 ns 244697531 BM_KeepRunningBatch 0 ns 0 ns 1000000091 BM_RangedFor 2 ns 2 ns 373357250
comment:13 by , 6 years ago
- Haiku x86_64 hrev52295+100: (beta1 + updates) (Dell i5 Optiplex 3010 desktop)
- rtl81xx NIC
- Linux x86_64 4.18.10-200.fc28.x86_64: (Ryzen 1800X)
- Intel 1Gbit NIC
- Cat5e, 1 Gbit switch
- [ 3] 0.0-10.0 sec 935 MBytes 784 Mbits/sec
- [ 3] 0.0-10.0 sec 935 MBytes 785 Mbits/sec
- [ 3] 0.0-10.0 sec 840 MBytes 704 Mbits/sec
- [ 3] 0.0-10.0 sec 935 MBytes 784 Mbits/sec
- [ 3] 0.0-10.0 sec 840 MBytes 705 Mbits/sec
- [ 3] 0.0-60.0 sec 5.37 GBytes 768 Mbits/sec
All of these seem normal over a local network :-|
comment:15 by , 11 months ago
This is likely improved or solved via hrev57492. We need to test though on real hardware once the nightly image builds :-)
comment:16 by , 11 months ago
Here's iperf strace call count on hrev57492. We still spend a lot of time in _kern_mutex_switch_lock.
~> strace -c iperf -c 127.0.0.1 ------------------------------------------------------------ Client connecting to 127.0.0.1, TCP port 5001 TCP window size: 32.0 KByte (default) ------------------------------------------------------------ [ 1] local 127.0.0.1 port 40420 connected with 127.0.0.1 port 5001 iperf: ../haiku-git/src/system/libroot/posix/pthread/pthread_mutex.cpp:95:status_t __pthread_mutex_lock(pthread_mutex_t*, uint32, bigtime_t): mutex->owner == -1 Abort Time % Usecs Calls Usecs/call Syscall ------ ---------- ------- ---------- -------------------- 100.00 10016159 1 10016159 _kern_mutex_switch_lock 0.00 68 6 11 _kern_set_area_protection 0.00 62 6 10 _kern_resize_area 0.00 44 2 22 _kern_spawn_thread 0.00 40 4 10 _kern_sigaction 0.00 33 2 16 _kern_mutex_unblock 0.00 32 1 32 _kern_image_relocated 0.00 27 2 13 _kern_resume_thread 0.00 21 2 10 _kern_get_system_info 0.00 14 1 14 _kern_open 0.00 12 1 12 _kern_create_area 0.00 11 1 11 _kern_get_next_image_info 0.00 11 1 11 _kern_reserve_address_range 0.00 10 1 10 _kern_close 0.00 10 1 10 _kern_read
Of note, iperf was crashing now before, and after the TCP changes leading up to hrev57492
iperf3 actually shows better results:
~> strace -c iperf3 -c 127.0.0.1 Connecting to host 127.0.0.1, port 5201 [ 5] local 127.0.0.1 port 40614 connected to 127.0.0.1 port 5201 [ ID] Interval Transfer Bitrate [ 5] 0.00-1.00 sec 633 MBytes 5.31 Gbits/sec [ 5] 1.00-2.00 sec 633 MBytes 5.31 Gbits/sec [ 5] 2.00-3.00 sec 632 MBytes 5.30 Gbits/sec [ 5] 3.00-4.00 sec 634 MBytes 5.31 Gbits/sec [ 5] 4.00-5.00 sec 632 MBytes 5.30 Gbits/sec [ 5] 5.00-6.00 sec 632 MBytes 5.30 Gbits/sec [ 5] 6.00-7.00 sec 631 MBytes 5.30 Gbits/sec [ 5] 7.00-8.00 sec 632 MBytes 5.30 Gbits/sec [ 5] 8.00-9.00 sec 630 MBytes 5.29 Gbits/sec [ 5] 9.00-10.00 sec 632 MBytes 5.30 Gbits/sec - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate [ 5] 0.00-10.00 sec 6.17 GBytes 5.30 Gbits/sec sender [ 5] 0.00-10.01 sec 6.17 GBytes 5.30 Gbits/sec receiver iperf Done. Time % Usecs Calls Usecs/call Syscall ------ ---------- ------- ---------- -------------------- 98.60 7115479 202334 35 _kern_write 1.19 85828 5066 16 _kern_select 0.17 12561 21 598 _kern_read 0.01 582 55 10 _kern_read_stat 0.00 239 9 26 _kern_open 0.00 204 12 17 _kern_map_file 0.00 131 12 10 _kern_set_area_protection 0.00 114 10 11 _kern_close 0.00 113 1 113 _kern_unlink 0.00 89 8 11 _kern_sigaction 0.00 81 6 13 _kern_image_relocated 0.00 78 1 78 _kern_write_stat 0.00 68 2 34 _kern_connect 0.00 60 6 10 _kern_reserve_address_range 0.00 57 5 11 _kern_normalize_path 0.00 54 5 10 _kern_resize_area 0.00 53 5 10 _kern_create_area 0.00 49 5 9 _kern_register_image 0.00 46 4 11 _kern_fcntl 0.00 40 4 10 _kern_read_link 0.00 34 3 11 _kern_getsockopt 0.00 28 2 14 _kern_socket 0.00 21 2 10 _kern_getsockname 0.00 21 2 10 _kern_get_system_info 0.00 20 2 10 _kern_get_thread_info 0.00 20 2 10 _kern_get_team_usage_info 0.00 19 1 19 _kern_unmap_memory 0.00 11 1 11 _kern_setsockopt 0.00 10 1 10 _kern_getuid 0.00 10 1 10 _kern_exit_team 0.00 10 1 10 _kern_get_next_image_info 0.00 10 1 10 _kern_ioctl 0.00 10 1 10 _kern_get_timezone 0.00 10 1 10 _kern_getpeername
Anyway, i'd say comparing even the strace'd calls, the latest hrev is faster on localhost. Lets call this one closed since the data in it is slowly becoming irrelevant.
I'm sure more optimizations could be made, but we need to start comparing to Linux or BSD :-)
comment:17 by , 11 months ago
Resolution: | → fixed |
---|---|
Status: | assigned → closed |
Linux test - wget https://cdn.kernel.org/pub/linux/kernel/v4.x/testing/linux-4.9-rc6.tar.xz
Linux - x86_64 - GPT - btrfs - SATA - 60MB/s