Opened 8 years ago

Closed 11 months ago

#13088 closed enhancement (fixed)

Performance: Investigate network I/O bottlenecks

Reported by: kallisti5 Owned by: nobody
Priority: normal Milestone: Unscheduled
Component: Network & Internet/TCP Version: R1/Development
Keywords: Cc:
Blocked By: Blocking:
Platform: All

Description

There seems to be some pretty serious performance bottlenecks around the network stack or the disk drivers, or the BFS filesystem.

Change History (17)

comment:1 by kallisti5, 8 years ago

Linux test - wget https://cdn.kernel.org/pub/linux/kernel/v4.x/testing/linux-4.9-rc6.tar.xz

07:00.0 Ethernet controller [0200]: Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller [10ec:8168] (rev 06)

Linux - x86_64 - GPT - btrfs - SATA - 60MB/s

comment:2 by kallisti5, 8 years ago

Haiku test - wget ​https://cdn.kernel.org/pub/linux/kernel/v4.x/testing/linux-4.9-rc6.tar.xz

Haiku x86_64, hrev50707, HaikuPorts wget 1.18-1 x86_64

device Network controller (Ethernet controller) [2|0|0]
  vendor 10ec: Realtek Semiconductor Co., Ltd.
  device 8168: RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
  • Haiku - x86_64 - GPT - BFS - SATA - 7.8MB/s
  • Haiku - x86_64 - GPT - BFS - RAM Disk (200M) - 8.22MB/s

That seems to show the issue is somewhere either in the network stack, NIC driver, or BFS driver.

comment:3 by kallisti5, 8 years ago

Looks like definitely network stack or network card driver.

Haiku x86_64, hrev50707

~> dd if=linux-4.9-rc6.tar.xz of=/RAM/linux-4.9-rc6.tar.xz
181945+1 records in
181945+1 records out
93156312 bytes (93 MB) copied, 0.764035 s, 122 MB/s

comment:4 by kallisti5, 8 years ago

#13089 might be one cause.

comment:5 by kallisti5, 8 years ago

This is definitely caused by #13089. If I disable TCP window scaling on Linux, I get ~8MB/s downloads.

sudo sysctl -w net.ipv4.tcp_window_scaling=0

linux-4.9-rc6.tar.xz.1  42%[===>     ]  37.76M  8.95MB/s    eta 6s

comment:6 by kallisti5, 8 years ago

Summary: Performance: Investigate bottlenecksPerformance: Investigate network I/O bottlenecks

comment:7 by kallisti5, 8 years ago

we actually do implement basic rfc1323 from what I can tell.. and wireshark shows tcp window scaling multipliers to be in use.

I did another transfer, wget from my local machine of a *larger* file and have some new behavior. The speed ramps up and down.

8MB/s, 7MB/s <10 or so seconds> 60MB/s <5 or so seconds> 2MB/s <10 or so seconds> 100MB/s <5 or so seconds> 2MB/s

So it seems like more of a congestion issue? No traffic on the local lan.

comment:8 by diver, 8 years ago

Component: - GeneralNetwork & Internet/TCP
Owner: changed from nobody to axeld

comment:9 by axeld, 8 years ago

Owner: changed from axeld to nobody
Status: newassigned

comment:10 by kallisti5, 6 years ago

We have iperf in our repositories, assuming there is nothing wrong with our port, I ran a few tests:

iperf -s
iperf -c 127.0.0.1
  • Linux x86_64 4.18.10-200.fc28.x86_64: (Ryzen 1800X)
    • TCP window size: 2.50 MByte (default)
      • [ 3] 0.0-10.0 sec 57.3 GBytes 49.3 Gbits/sec
      • [ 3] 0.0-10.0 sec 57.0 GBytes 48.9 Gbits/sec
  • Linux x86_64 4.14.16-300.fc27.x86_64: (VM @ Vultr)
    • TCP window size: 2.50 MByte (default)
      • [ 3] 0.0-10.0 sec 17.8 GBytes 15.3 Gbits/sec
      • [ 3] 0.0-10.0 sec 16.6 GBytes 14.3 Gbits/sec
  • Haiku, x86_64 hrev52295-16-g039fdd9ffe (early beta1 branch) (VM @ Vultr)
    • TCP window size: 32.0 KByte (default)
    • Uptime 15 days, 22 hours, 46 minutes - exposed to the internet.
      • [ 3] 0.0-10.0 sec 71.2 MBytes 59.7 Mbits/sec
      • [ 3] 0.0-10.0 sec 44.7 MBytes 37.5 Mbits/sec
  • Haiku, x86_64 hrev52295-16-g039fdd9ffe (early beta1 branch) (VM @ Vultr)
    • TCP window size: 32.0 KByte (default)
    • Uptime 1 minute
      • [ 3] 0.0-10.0 sec 62.2 MBytes 52.2 Mbits/sec
      • [ 3] 0.0-10.0 sec 58.6 MBytes 49.2 Mbits/sec
  • Haiku, x86_64 hrev52295-16-g039fdd9ffe (early beta1 branch) (VM @ Vultr)
    • TCP window size: 2.50 MByte
    • Uptime 7 minutes
      • [ 3] 0.0-10.0 sec 65.8 MBytes 55.2 Mbits/sec
      • [ 3] 0.0-10.0 sec 33.6 MBytes 28.2 Mbits/sec
  • Haiku x86_64 hrev52295+100: (beta1 + updates) (Dell i5 Optiplex 3010 desktop)
    • TCP window size: 32Kbyte (default)
      • [ 3] 0.0-10.0 sec 9.35 GBytes 8.03 Gbits/sec
      • [ 3] 0.0-10.0 sec 6.48 GBytes 5.56 Gbits/sec
  • Haiku x86_64 hrev52295+100: (beta1 + updates) (Dell i5 Optiplex 3010 desktop)
    • TCP window size: 2.50 Mbyte
      • [ 3] 0.0-10.0 sec 10.7 GBytes 9.23 Gbits/sec
      • [ 3] 0.0-10.0 sec 18.0 MBytes 15.1 Mbits/sec
Version 1, edited 6 years ago by kallisti5 (previous) (next) (diff)

comment:11 by kallisti5, 6 years ago

One interesting output is strace -c from iperf:

~> strace -c iperf -c 127.0.0.1
------------------------------------------------------------
Client connecting to 127.0.0.1, TCP port 5001
TCP window size: 32.0 KByte (default)
------------------------------------------------------------
[  3] local 127.0.0.1 port 40108 connected with 127.0.0.1 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  58.9 MBytes  49.4 Mbits/sec

Time % Usecs      Calls   Usecs/call Syscall
------ ---------- ------- ---------- --------------------
100.00   10079874       3    3359958 _kern_mutex_switch_lock
  0.00        128       1        128 _kern_image_relocated
  0.00         23       2         11 _kern_spawn_thread
  0.00         19       6          3 _kern_set_area_protection
  0.00          4       1          4 _kern_create_area
  0.00          3       1          3 _kern_exit_team
  0.00          3       2          1 _kern_get_system_info
  0.00          2       1          2 _kern_reserve_address_range
  0.00          2       2          1 _kern_resume_thread
  0.00          1       1          1 _kern_resize_area
  0.00          0       1          0 _kern_get_next_image_info
  0.00          0       4          0 _kern_sigaction

comment:12 by kallisti5, 6 years ago

https://github.com/google/benchmark seems like it might be helpful to see where we're at vs Linux.

Linux 4.18.10-200.fc28.x86_64:

2018-10-05 09:42:27
Running ./test/basic_test
Run on (8 X 3600 MHz CPU s)
CPU Caches:
  L1 Data 32K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 256K (x4)
  L3 Unified 8192K (x1)
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
***WARNING*** Library was built as DEBUG. Timings may be affected.
-------------------------------------------------------------------------------------
Benchmark                                              Time           CPU Iterations
-------------------------------------------------------------------------------------
BM_empty                                               3 ns          3 ns  233438098
BM_empty/threads:8                                     1 ns          6 ns  122371744
BM_spin_empty/8                                       46 ns         46 ns   15205421
BM_spin_empty/512                                   2695 ns       2692 ns     257997
BM_spin_empty/8192                                 43966 ns      43901 ns      16100
BM_spin_empty/8/threads:8                             12 ns         95 ns    7261392
BM_spin_empty/512/threads:8                          698 ns       5488 ns     130448
BM_spin_empty/8192/threads:8                       11046 ns      86815 ns       7984
BM_spin_pause_before/8                                46 ns         46 ns   15192276
BM_spin_pause_before/512                            2693 ns       2689 ns     267952
BM_spin_pause_before/8192                          42290 ns      42232 ns      16134
BM_spin_pause_before/8/threads:8                      12 ns         97 ns    7226928
BM_spin_pause_before/512/threads:8                   701 ns       5562 ns     128960
BM_spin_pause_before/8192/threads:8                11219 ns      88271 ns       7840
BM_spin_pause_during/8                               458 ns        459 ns    1525514
BM_spin_pause_during/512                            3326 ns       3325 ns     201580
BM_spin_pause_during/8192                          45514 ns      45407 ns      15909
BM_spin_pause_during/8/threads:8                      75 ns        594 ns    1175000
BM_spin_pause_during/512/threads:8                   771 ns       6036 ns     116296
BM_spin_pause_during/8192/threads:8                11327 ns      89091 ns       7984
BM_pause_during                                      437 ns        437 ns    1636400
BM_pause_during/threads:8                             66 ns        517 ns    1353320
BM_pause_during/real_time                            425 ns        427 ns    1620308
BM_pause_during/real_time/threads:8                   66 ns        517 ns   10697656
BM_spin_pause_after/8                                 54 ns         54 ns   10148460
BM_spin_pause_after/512                             3067 ns       3051 ns     234550
BM_spin_pause_after/8192                           47197 ns      47013 ns      13512
BM_spin_pause_after/8/threads:8                       13 ns         97 ns    8500864
BM_spin_pause_after/512/threads:8                    714 ns       5651 ns     127944
BM_spin_pause_after/8192/threads:8                 11470 ns      88168 ns       8456
BM_spin_pause_before_and_after/8                      54 ns         54 ns   13917278
BM_spin_pause_before_and_after/512                  3030 ns       3016 ns     213215
BM_spin_pause_before_and_after/8192                43801 ns      43707 ns      16402
BM_spin_pause_before_and_after/8/threads:8            12 ns         95 ns    7654360
BM_spin_pause_before_and_after/512/threads:8         719 ns       5040 ns     138824
BM_spin_pause_before_and_after/8192/threads:8      11009 ns      84438 ns       8088
BM_empty_stop_start                                    2 ns          2 ns  327797067
BM_empty_stop_start/threads:8                          0 ns          3 ns  275301568
BM_KeepRunning                                         3 ns          3 ns  244697531
BM_KeepRunningBatch                                    0 ns          0 ns 1000000091
BM_RangedFor                                           2 ns          2 ns  373357250

comment:13 by kallisti5, 6 years ago

  • Haiku x86_64 hrev52295+100: (beta1 + updates) (Dell i5 Optiplex 3010 desktop)
    • rtl81xx NIC
  • Linux x86_64 4.18.10-200.fc28.x86_64: (Ryzen 1800X)
    • Intel 1Gbit NIC
  • Cat5e, 1 Gbit switch
  • [ 3] 0.0-10.0 sec 935 MBytes 784 Mbits/sec
  • [ 3] 0.0-10.0 sec 935 MBytes 785 Mbits/sec
  • [ 3] 0.0-10.0 sec 840 MBytes 704 Mbits/sec
  • [ 3] 0.0-10.0 sec 935 MBytes 784 Mbits/sec
  • [ 3] 0.0-10.0 sec 840 MBytes 705 Mbits/sec
  • [ 3] 0.0-60.0 sec 5.37 GBytes 768 Mbits/sec

All of these seem normal over a local network :-|

comment:14 by waddlesplash, 5 years ago

No change required?

comment:15 by kallisti5, 11 months ago

This is likely improved or solved via hrev57492. We need to test though on real hardware once the nightly image builds :-)

comment:16 by kallisti5, 11 months ago

Here's iperf strace call count on hrev57492. We still spend a lot of time in _kern_mutex_switch_lock.

~> strace -c iperf -c 127.0.0.1
------------------------------------------------------------
Client connecting to 127.0.0.1, TCP port 5001
TCP window size: 32.0 KByte (default)
------------------------------------------------------------
[  1] local 127.0.0.1 port 40420 connected with 127.0.0.1 port 5001
iperf: ../haiku-git/src/system/libroot/posix/pthread/pthread_mutex.cpp:95:status_t __pthread_mutex_lock(pthread_mutex_t*, uint32, bigtime_t): mutex->owner == -1
Abort

Time % Usecs      Calls   Usecs/call Syscall
------ ---------- ------- ---------- --------------------
100.00   10016159       1   10016159 _kern_mutex_switch_lock
  0.00         68       6         11 _kern_set_area_protection
  0.00         62       6         10 _kern_resize_area
  0.00         44       2         22 _kern_spawn_thread
  0.00         40       4         10 _kern_sigaction
  0.00         33       2         16 _kern_mutex_unblock
  0.00         32       1         32 _kern_image_relocated
  0.00         27       2         13 _kern_resume_thread
  0.00         21       2         10 _kern_get_system_info
  0.00         14       1         14 _kern_open
  0.00         12       1         12 _kern_create_area
  0.00         11       1         11 _kern_get_next_image_info
  0.00         11       1         11 _kern_reserve_address_range
  0.00         10       1         10 _kern_close
  0.00         10       1         10 _kern_read

Of note, iperf was crashing now before, and after the TCP changes leading up to hrev57492

iperf3 actually shows better results:

~> strace -c iperf3 -c 127.0.0.1
Connecting to host 127.0.0.1, port 5201
[  5] local 127.0.0.1 port 40614 connected to 127.0.0.1 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec   633 MBytes  5.31 Gbits/sec                  
[  5]   1.00-2.00   sec   633 MBytes  5.31 Gbits/sec                  
[  5]   2.00-3.00   sec   632 MBytes  5.30 Gbits/sec                  
[  5]   3.00-4.00   sec   634 MBytes  5.31 Gbits/sec                  
[  5]   4.00-5.00   sec   632 MBytes  5.30 Gbits/sec                  
[  5]   5.00-6.00   sec   632 MBytes  5.30 Gbits/sec                  
[  5]   6.00-7.00   sec   631 MBytes  5.30 Gbits/sec                  
[  5]   7.00-8.00   sec   632 MBytes  5.30 Gbits/sec                  
[  5]   8.00-9.00   sec   630 MBytes  5.29 Gbits/sec                  
[  5]   9.00-10.00  sec   632 MBytes  5.30 Gbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-10.00  sec  6.17 GBytes  5.30 Gbits/sec                  sender
[  5]   0.00-10.01  sec  6.17 GBytes  5.30 Gbits/sec                  receiver

iperf Done.

Time % Usecs      Calls   Usecs/call Syscall
------ ---------- ------- ---------- --------------------
 98.60    7115479  202334         35 _kern_write
  1.19      85828    5066         16 _kern_select
  0.17      12561      21        598 _kern_read
  0.01        582      55         10 _kern_read_stat
  0.00        239       9         26 _kern_open
  0.00        204      12         17 _kern_map_file
  0.00        131      12         10 _kern_set_area_protection
  0.00        114      10         11 _kern_close
  0.00        113       1        113 _kern_unlink
  0.00         89       8         11 _kern_sigaction
  0.00         81       6         13 _kern_image_relocated
  0.00         78       1         78 _kern_write_stat
  0.00         68       2         34 _kern_connect
  0.00         60       6         10 _kern_reserve_address_range
  0.00         57       5         11 _kern_normalize_path
  0.00         54       5         10 _kern_resize_area
  0.00         53       5         10 _kern_create_area
  0.00         49       5          9 _kern_register_image
  0.00         46       4         11 _kern_fcntl
  0.00         40       4         10 _kern_read_link
  0.00         34       3         11 _kern_getsockopt
  0.00         28       2         14 _kern_socket
  0.00         21       2         10 _kern_getsockname
  0.00         21       2         10 _kern_get_system_info
  0.00         20       2         10 _kern_get_thread_info
  0.00         20       2         10 _kern_get_team_usage_info
  0.00         19       1         19 _kern_unmap_memory
  0.00         11       1         11 _kern_setsockopt
  0.00         10       1         10 _kern_getuid
  0.00         10       1         10 _kern_exit_team
  0.00         10       1         10 _kern_get_next_image_info
  0.00         10       1         10 _kern_ioctl
  0.00         10       1         10 _kern_get_timezone
  0.00         10       1         10 _kern_getpeername

Anyway, i'd say comparing even the strace'd calls, the latest hrev is faster on localhost. Lets call this one closed since the data in it is slowly becoming irrelevant.

I'm sure more optimizations could be made, but we need to start comparing to Linux or BSD :-)

comment:17 by kallisti5, 11 months ago

Resolution: fixed
Status: assignedclosed
Note: See TracTickets for help on using tickets.