Opened 3 years ago
Closed 8 months ago
#17208 closed bug (fixed)
Memory usage constantly increases on network with lots of devices
Reported by: | pakyr | Owned by: | nobody |
---|---|---|---|
Priority: | critical | Milestone: | R1/beta5 |
Component: | Network & Internet | Version: | R1/beta3 |
Keywords: | Cc: | korli | |
Blocked By: | Blocking: | #18585 | |
Platform: | All |
Description
On a network with many (thousands) of devices, memory usage constantly increases at a rate of one megabyte every 3-10 seconds, until the system begins swapping to disk and eventually crashes. This seems to be because of a massive number of MDNS packets (even when totally idle, the system receives ~100kbps). Included is a wireshark capture showing the MDNS packets being recieved. This issue doesn't appear on networks without lots of devices, and the only difference was all the MDNS packets, which is why I assume this is somehow the issue.
Attachments (2)
Change History (36)
comment:1 by , 3 years ago
comment:4 by , 3 years ago
No clue. All I can say is that it's happened to me on two networks, one at a large business, and the other at a large university. I could upload a syslog, or a video of the behavior, though I doubt either would help much.
comment:5 by , 3 years ago
Lets try a syslog at least. Wonder about an install not behind a router/firewall?
comment:6 by , 3 years ago
Here you go. I booted the machine, let it sit idle for about minute with memory usage stable at 376mb, then connected to the network and let it sit idle for another minute, during which memory usage increased by ~15mb. That's when I grabbed the syslog. Also, not sure what you mean by "install not behind a router/firewall".
comment:7 by , 3 years ago
Maybe we could test with iperf or something similar to replicate easily? https://taosecurity.blogspot.com/2006/09/generating-multicast-traffic.html
comment:8 by , 3 years ago
Please try these things (one after the other, if none initially produces any result):
- Disabling the WiFi altogether, and see if the memory usage goes back down.
- Connecting via ethernet instead of WiFi, and seeing if the memory usage still goes up the same way.
- Booting Haiku in a virtual machine with a virtio network adapter, and seeing if the memory usage still goes up the same way.
comment:9 by , 3 years ago
- The increase stopped, but the usage did not go back down at all, even after waiting 20 minutes.
- Unable to test this right now.
- Booted in VMWare on a different PC, and had the same issue, but only when the network adapter was in 'bridged' mode instead of 'NAT' mode.
I also thought it may have been due to something I installed on my main Haiku installation, so I booted my laptop using a beta 3 installer, and had the same issue.
comment:10 by , 3 years ago
Update: Was able to test with an ethernet cable on the original laptop; no change, the issue manifested in the same way.
comment:11 by , 3 years ago
Component: | - General → Network & Internet |
---|
This sounds like it should be relatively easy to reproduce and track down where the memory is really going, then. I'll see if I can take a look before too long.
comment:12 by , 3 years ago
Milestone: | Unscheduled → R1/beta4 |
---|---|
Priority: | normal → critical |
comment:13 by , 3 years ago
Cc: | added |
---|
CC korli: MDNS packets are UDP multicast, which you reenabled last year.
comment:14 by , 3 years ago
Just to clarify, is the fix in https://review.haiku-os.org/c/haiku/+/4791 related to this?
comment:15 by , 3 years ago
I don't think it is, but I could be mistaken. I didn't test it with mDNS anyway.
comment:16 by , 3 years ago
Actually, the commit that patch fixed was made only in November and this ticket was opened in August. So clearly this problem predates that one, so it isn't related.
comment:17 by , 3 years ago
I tried to replay the capture dump locally with tcpreplay, the haiku host sees the packets slowly coming (checked with tcpdump). It's difficult to notice anything happening on the used memory because of the replay. Maybe the replay is too slow to reproduce.
comment:18 by , 3 years ago
It should be possible to change the replay speed with tcpreplay options -p, -x or -t
comment:19 by , 3 years ago
I managed to reproduce this, though the rate is much slower for me (0.1MB every 3-10 seconds.) When it is occurring, more and more slab areas are created, e.g.
slab memory manager: created area 0xffffffffb0801000 (240115) slab memory manager: created area 0xffffffffb1001000 (240117)
So, I dropped into KDL and dumped all object caches (includes net_buffers and the default malloc heap), exited, waited a while (without doing anything), then did it again. Here are just the differences:
address name objsize align usage empty usedobj total flags -0xffffffff82006570 block allocator: 48 48 8 6066176 0 124381 124404 80000000 +0xffffffff82006570 block allocator: 48 48 8 6881280 0 141051 141120 80000000 -0xffffffff82006720 block allocator: 64 64 64 2445312 0 37593 37611 80000000 +0xffffffff82006720 block allocator: 64 64 64 2527232 0 36003 38871 80000000 -0xffffffff82006de0 block allocator: 128 128 128 14479360 0 109577 109585 80000000 +0xffffffff82006de0 block allocator: 128 128 128 17960960 0 135930 135935 80000000 -0xffffffff82008510 block allocator: 256 256 256 16846848 0 61686 61695 80000000 +0xffffffff82008510 block allocator: 256 256 256 20885504 0 76482 76485 80000000 -0xffffffff82008a80 block allocator: 448 448 8 21102592 0 46366 46368 80000000 +0xffffffff82008a80 block allocator: 448 448 8 27103232 0 59551 59553 80000000 -0xffffffff8200b800 block allocator: 4096 4096 4096 458752 1 86 112 88000000 +0xffffffff8200b800 block allocator: 4096 4096 4096 458752 1 85 112 88000000 -0xffffffff8200c800 block allocator: 8192 8192 8192 655360 0 76 80 88000000 +0xffffffff8200c800 block allocator: 8192 8192 8192 655360 0 77 80 88000000 -0xffffffff8200de00 cache refs 16 8 794624 0 48815 48888 0 +0xffffffff8200de00 cache refs 16 8 1011712 0 61994 62244 0 -0xffffffff8200d8c0 vnode caches 224 8 10526720 0 46251 46260 0 +0xffffffff8200d8c0 vnode caches 224 8 13529088 0 59451 59454 0 -0xffffffff8200d540 null caches 192 8 81920 0 420 420 0 +0xffffffff8200d540 null caches 192 8 86016 0 429 441 0 -0xffffffff823adc48 cached blocks 104 8 5242880 0 49705 50400 20000000 +0xffffffff823adc48 cached blocks 104 8 6815744 0 63165 65520 20000000 -0xffffffff823bca00 block cache buffers 2048 8 203948032 0 99412 99584 20000000 +0xffffffff823bca00 block cache buffers 2048 8 258998272 0 126330 126464 20000000
comment:20 by , 3 years ago
And of course now that I've created some testing images, I can't seem reproduce it.
Anyone else that has managed to replicate this repeatedly?
comment:22 by , 2 years ago
Not sure if the question was targeted at me since I'm the one who reported it, but the issue is still happening.
comment:23 by , 2 years ago
It was. If you can reproduce the issue, please drop to KDL, run the slabs
command, exit KDL, wait for memory to go up by a significant amount (at least a few MB), then drop to KDL and run the slabs
command again; then exit KDL and upload a copy of your syslog here.
comment:24 by , 2 years ago
I tried entering KDL using the alt-sysreq-D shortcut from this (https://www.haiku-os.org/documents/dev/welcome_to_kernel_debugging_land/) link, but it did not work. Is there any other way to enter? If not I will dig up another keyboard when I get the chance.
comment:25 by , 2 years ago
You can enter via /bin/kernel_debugger
. However, if you cannot manage to enter it via the keyboard shortcut, odds are your keyboard will not work in KDL... but it's worth a try.
comment:26 by , 2 years ago
You also noted above that you can reproduce this in VMware; that may be a sure way to get a way to drop into KDL and have working keyboard.
comment:27 by , 2 years ago
Sorry if I was not clear, but I meant to run slabs
the first time while already connected to the network, i.e. do as little as possible in between runs, just let memory usage go up. Also please indicate approximately how much memory was used at the time of the first run and then at the time of the second run.
by , 2 years ago
Attachment: | syslog_third added |
---|
My bad; booted from latest nightly live USB (my install suddenly crapped out for some reason), connected, ran slabs, waited for usage to increase by several MBs, then ran slabs again
comment:28 by , 2 years ago
Significant changes:
address name objsize align usage empty usedobj total flags -KERN: 0xffffffff820067c0 block allocator: 64 64 64 921600 0 14158 14175 80000000 +KERN: 0xffffffff820067c0 block allocator: 64 64 64 1085440 0 16671 16695 80000000 -KERN: 0xffffffff8e17c400 net buffer cache 360 8 7708672 0 20696 20702 0 +KERN: 0xffffffff8e17c400 net buffer cache 360 8 15294464 0 41071 41074 0 -KERN: 0xffffffff8dec4400 data node cache 2048 8 42401792 0 20696 20704 0 +KERN: 0xffffffff8dec4400 data node cache 2048 8 84148224 0 41071 41088 0 -KERN: 0xffffffff8e27ec60 mbufs 256 8 118784 0 414 435 0 +KERN: 0xffffffff8e27ec60 mbufs 256 8 315392 0 414 1155 0 -KERN: 0xffffffff8e21fe00 mbuf jumbo chunks 4096 8 1310720 0 305 320 0 +KERN: 0xffffffff8e21fe00 mbuf jumbo chunks 4096 8 2228224 0 309 544 0
The mbufs areas are much more fragmented, but there isn't an actual increase in the used objects. On the other hand, the net buffer and data nodes have doubled in object usage. That seems pretty clear as to where and what the leak is.
comment:29 by , 2 years ago
My own knowledge of the net_buffer and data-node system is not very great. I don't know where one would start tracing a leak of buffers (or, possibly, something is just throwing them into a queue and never dequeuing them.) Any other developers know how we might try to pinpoint where the buffers are winding up?
comment:30 by , 22 months ago
Just a quick note - for the first time, I had the opportunity to test this issue on a different machine on bare metal with a different network card, and can confirm that it happens on there as well
comment:31 by , 15 months ago
Blocking: | 18585 added |
---|
comment:34 by , 8 months ago
Milestone: | Unscheduled → R1/beta5 |
---|---|
Resolution: | → fixed |
Status: | new → closed |
Good, thanks for testing!
"the memory usage" is a bit vague. Can you at least identify which team/process is leaking memory? Is it kernel? Is it net_server? Is it something else?