Opened 16 months ago

Last modified 10 days ago

#17208 new bug

Memory usage constantly increases on network with lots of devices

Reported by: pakyr Owned by: nobody
Priority: critical Milestone: Unscheduled
Component: Network & Internet Version: R1/beta3
Keywords: Cc: korli
Blocked By: Blocking:
Platform: All

Description

On a network with many (thousands) of devices, memory usage constantly increases at a rate of one megabyte every 3-10 seconds, until the system begins swapping to disk and eventually crashes. This seems to be because of a massive number of MDNS packets (even when totally idle, the system receives ~100kbps). Included is a wireshark capture showing the MDNS packets being recieved. This issue doesn't appear on networks without lots of devices, and the only difference was all the MDNS packets, which is why I assume this is somehow the issue.

Attachments (4)

capture.pcapng (1.2 MB ) - added by pakyr 16 months ago.
Wireshark Capture
syslog (173.3 KB ) - added by pakyr 16 months ago.
syslog
syslog_second (239.9 KB ) - added by pakyr 5 weeks ago.
Ran slabs, connected to network and waited for usage to go up several MB, then ran it again
syslog_third (317.5 KB ) - added by pakyr 5 weeks ago.
My bad; booted from latest nightly live USB (my install suddenly crapped out for some reason), connected, ran slabs, waited for usage to increase by several MBs, then ran slabs again

Download all attachments as: .zip

Change History (33)

by pakyr, 16 months ago

Attachment: capture.pcapng added

Wireshark Capture

comment:1 by pulkomandy, 16 months ago

"the memory usage" is a bit vague. Can you at least identify which team/process is leaking memory? Is it kernel? Is it net_server? Is it something else?

comment:2 by pakyr, 16 months ago

Kernel Team.

comment:3 by Coldfirex, 16 months ago

Would there be a way to simulate this?

comment:4 by pakyr, 16 months ago

No clue. All I can say is that it's happened to me on two networks, one at a large business, and the other at a large university. I could upload a syslog, or a video of the behavior, though I doubt either would help much.

comment:5 by Coldfirex, 16 months ago

Lets try a syslog at least. Wonder about an install not behind a router/firewall?

by pakyr, 16 months ago

Attachment: syslog added

syslog

comment:6 by pakyr, 16 months ago

Here you go. I booted the machine, let it sit idle for about minute with memory usage stable at 376mb, then connected to the network and let it sit idle for another minute, during which memory usage increased by ~15mb. That's when I grabbed the syslog. Also, not sure what you mean by "install not behind a router/firewall".

Last edited 16 months ago by pakyr (previous) (diff)

comment:7 by Coldfirex, 16 months ago

Maybe we could test with iperf or something similar to replicate easily? https://taosecurity.blogspot.com/2006/09/generating-multicast-traffic.html

comment:8 by waddlesplash, 16 months ago

Please try these things (one after the other, if none initially produces any result):

  1. Disabling the WiFi altogether, and see if the memory usage goes back down.
  2. Connecting via ethernet instead of WiFi, and seeing if the memory usage still goes up the same way.
  3. Booting Haiku in a virtual machine with a virtio network adapter, and seeing if the memory usage still goes up the same way.

comment:9 by pakyr, 16 months ago

  1. The increase stopped, but the usage did not go back down at all, even after waiting 20 minutes.
  1. Unable to test this right now.
  1. Booted in VMWare on a different PC, and had the same issue, but only when the network adapter was in 'bridged' mode instead of 'NAT' mode.

I also thought it may have been due to something I installed on my main Haiku installation, so I booted my laptop using a beta 3 installer, and had the same issue.

comment:10 by pakyr, 16 months ago

Update: Was able to test with an ethernet cable on the original laptop; no change, the issue manifested in the same way.

comment:11 by waddlesplash, 16 months ago

Component: - GeneralNetwork & Internet

This sounds like it should be relatively easy to reproduce and track down where the memory is really going, then. I'll see if I can take a look before too long.

comment:12 by waddlesplash, 12 months ago

Milestone: UnscheduledR1/beta4
Priority: normalcritical

comment:13 by waddlesplash, 12 months ago

Cc: korli added

CC korli: MDNS packets are UDP multicast, which you reenabled last year.

comment:14 by pulkomandy, 12 months ago

Just to clarify, is the fix in https://review.haiku-os.org/c/haiku/+/4791 related to this?

comment:15 by waddlesplash, 12 months ago

I don't think it is, but I could be mistaken. I didn't test it with mDNS anyway.

comment:16 by waddlesplash, 12 months ago

Actually, the commit that patch fixed was made only in November and this ticket was opened in August. So clearly this problem predates that one, so it isn't related.

comment:17 by korli, 12 months ago

I tried to replay the capture dump locally with tcpreplay, the haiku host sees the packets slowly coming (checked with tcpdump). It's difficult to notice anything happening on the used memory because of the replay. Maybe the replay is too slow to reproduce.

comment:18 by pulkomandy, 12 months ago

It should be possible to change the replay speed with tcpreplay options -p, -x or -t

https://linux.die.net/man/1/tcpreplay

comment:19 by waddlesplash, 9 months ago

I managed to reproduce this, though the rate is much slower for me (0.1MB every 3-10 seconds.) When it is occurring, more and more slab areas are created, e.g.

slab memory manager: created area 0xffffffffb0801000 (240115)
slab memory manager: created area 0xffffffffb1001000 (240117)

So, I dropped into KDL and dumped all object caches (includes net_buffers and the default malloc heap), exited, waited a while (without doing anything), then did it again. Here are just the differences:

            address                   name  objsize    align    usage  empty  usedobj    total    flags
-0xffffffff82006570    block allocator: 48       48        8  6066176      0   124381   124404 80000000
+0xffffffff82006570    block allocator: 48       48        8  6881280      0   141051   141120 80000000
-0xffffffff82006720    block allocator: 64       64       64  2445312      0    37593    37611 80000000
+0xffffffff82006720    block allocator: 64       64       64  2527232      0    36003    38871 80000000
-0xffffffff82006de0   block allocator: 128      128      128 14479360      0   109577   109585 80000000
+0xffffffff82006de0   block allocator: 128      128      128 17960960      0   135930   135935 80000000
-0xffffffff82008510   block allocator: 256      256      256 16846848      0    61686    61695 80000000
+0xffffffff82008510   block allocator: 256      256      256 20885504      0    76482    76485 80000000
-0xffffffff82008a80   block allocator: 448      448        8 21102592      0    46366    46368 80000000
+0xffffffff82008a80   block allocator: 448      448        8 27103232      0    59551    59553 80000000
-0xffffffff8200b800  block allocator: 4096     4096     4096   458752      1       86      112 88000000
+0xffffffff8200b800  block allocator: 4096     4096     4096   458752      1       85      112 88000000
-0xffffffff8200c800  block allocator: 8192     8192     8192   655360      0       76       80 88000000
+0xffffffff8200c800  block allocator: 8192     8192     8192   655360      0       77       80 88000000
-0xffffffff8200de00             cache refs       16        8   794624      0    48815    48888        0
+0xffffffff8200de00             cache refs       16        8  1011712      0    61994    62244        0
-0xffffffff8200d8c0           vnode caches      224        8 10526720      0    46251    46260        0
+0xffffffff8200d8c0           vnode caches      224        8 13529088      0    59451    59454        0
-0xffffffff8200d540            null caches      192        8    81920      0      420      420        0
+0xffffffff8200d540            null caches      192        8    86016      0      429      441        0
-0xffffffff823adc48          cached blocks      104        8  5242880      0    49705    50400 20000000
+0xffffffff823adc48          cached blocks      104        8  6815744      0    63165    65520 20000000
-0xffffffff823bca00    block cache buffers     2048        8 203948032      0    99412    99584 20000000
+0xffffffff823bca00    block cache buffers     2048        8 258998272      0   126330   126464 20000000

comment:20 by waddlesplash, 9 months ago

And of course now that I've created some testing images, I can't seem reproduce it.

Anyone else that has managed to replicate this repeatedly?

comment:21 by waddlesplash, 5 weeks ago

Milestone: R1/beta4Unscheduled

No reply, deprioritizing.

comment:22 by pakyr, 5 weeks ago

Not sure if the question was targeted at me since I'm the one who reported it, but the issue is still happening.

comment:23 by waddlesplash, 5 weeks ago

It was. If you can reproduce the issue, please drop to KDL, run the slabs command, exit KDL, wait for memory to go up by a significant amount (at least a few MB), then drop to KDL and run the slabs command again; then exit KDL and upload a copy of your syslog here.

comment:24 by pakyr, 5 weeks ago

I tried entering KDL using the alt-sysreq-D shortcut from this (https://www.haiku-os.org/documents/dev/welcome_to_kernel_debugging_land/) link, but it did not work. Is there any other way to enter? If not I will dig up another keyboard when I get the chance.

comment:25 by waddlesplash, 5 weeks ago

You can enter via /bin/kernel_debugger. However, if you cannot manage to enter it via the keyboard shortcut, odds are your keyboard will not work in KDL... but it's worth a try.

comment:26 by waddlesplash, 5 weeks ago

You also noted above that you can reproduce this in VMware; that may be a sure way to get a way to drop into KDL and have working keyboard.

by pakyr, 5 weeks ago

Attachment: syslog_second added

Ran slabs, connected to network and waited for usage to go up several MB, then ran it again

comment:27 by waddlesplash, 5 weeks ago

Sorry if I was not clear, but I meant to run slabs the first time while already connected to the network, i.e. do as little as possible in between runs, just let memory usage go up. Also please indicate approximately how much memory was used at the time of the first run and then at the time of the second run.

by pakyr, 5 weeks ago

Attachment: syslog_third added

My bad; booted from latest nightly live USB (my install suddenly crapped out for some reason), connected, ran slabs, waited for usage to increase by several MBs, then ran slabs again

comment:28 by waddlesplash, 10 days ago

Significant changes:

                  address                   name  objsize    align    usage  empty  usedobj    total    flags
-KERN: 0xffffffff820067c0    block allocator: 64       64       64   921600      0    14158    14175 80000000
+KERN: 0xffffffff820067c0    block allocator: 64       64       64  1085440      0    16671    16695 80000000

-KERN: 0xffffffff8e17c400       net buffer cache      360        8  7708672      0    20696    20702        0
+KERN: 0xffffffff8e17c400       net buffer cache      360        8 15294464      0    41071    41074        0

-KERN: 0xffffffff8dec4400        data node cache     2048        8 42401792      0    20696    20704        0
+KERN: 0xffffffff8dec4400        data node cache     2048        8 84148224      0    41071    41088        0

-KERN: 0xffffffff8e27ec60                  mbufs      256        8   118784      0      414      435        0
+KERN: 0xffffffff8e27ec60                  mbufs      256        8   315392      0      414     1155        0

-KERN: 0xffffffff8e21fe00 mbuf jumbo chunks          4096        8  1310720      0      305      320        0
+KERN: 0xffffffff8e21fe00 mbuf jumbo chunks          4096        8  2228224      0      309      544        0

The mbufs areas are much more fragmented, but there isn't an actual increase in the used objects. On the other hand, the net buffer and data nodes have doubled in object usage. That seems pretty clear as to where and what the leak is.

comment:29 by waddlesplash, 10 days ago

My own knowledge of the net_buffer and data-node system is not very great. I don't know where one would start tracing a leak of buffers (or, possibly, something is just throwing them into a queue and never dequeuing them.) Any other developers know how we might try to pinpoint where the buffers are winding up?

Note: See TracTickets for help on using tickets.