Opened 15 years ago
Closed 15 years ago
#5328 closed bug (fixed)
Had reserved page but threre is none after latest slab/vm changes
Reported by: | mmlr | Owned by: | bonefish |
---|---|---|---|
Priority: | high | Milestone: | R1 |
Component: | System/Kernel | Version: | R1/Development |
Keywords: | Cc: | Jens.Arm@… | |
Blocked By: | Blocking: | ||
Platform: | All |
Description
I had updated my "server" to hrev35311. It runs a transmission-daemon so has quite a bit of network load. I checksummed some large files (a few GB) and after a while the machine KDLed with that message once inside a network mbuf allocation and once when doing another kind of allocation I don't remember anymore. Continuing the panic would result in a vm_page_fault. I've reverted my kernel to hrev35195 which runs stable. I think I can reproduce it quite reliably even though it takes a bit of time before panicking, so please tell if I can grab anything specific from the kernel debugger.
Attachments (1)
Change History (16)
follow-up: 2 comment:1 by , 15 years ago
comment:2 by , 15 years ago
Replying to bonefish:
free count + clear count >= unreserved free pages + system reserved pages
As far as I've seen we only panic in case we actually failed to allocate a page, hiding cases where there is a violation of the rules but we just don't happen to be as short on pages. Couldn't we introduce a paranoid checking of the reserve on each allocation and then panic in case of violation?
The policy for allocating pages is to first reserve as many as needed at maximum, allocate as many as needed, and unreserve the maximum again. The invariant can be violated by allocating more pages than reserved or by unreserving pages one has not reserved. Other than that only a bug in vm_page.cpp can cause that.
The most likely change for a potential bug is hrev35295, where I introduced the system reserve, which made the page reservation code in vm_page.cpp significantly more complicated. I've read through it several times without spotting a problem, though. Anyway, you could try the previous version and verify whether it is to blame.
It's definitely there in hrev35295, I could reproduce it with the transmission + checksum method once. A second attempt wouldn't trigger it though. I've now put the kernel at hrev35294 and wasn't able to reproduce yet, leaving it running for a while.
comment:3 by , 15 years ago
Cc: | added |
---|
comment:4 by , 15 years ago
Replying to mmlr:
Replying to bonefish:
free count + clear count >= unreserved free pages + system reserved pages
As far as I've seen we only panic in case we actually failed to allocate a page, hiding cases where there is a violation of the rules but we just don't happen to be as short on pages. Couldn't we introduce a paranoid checking of the reserve on each allocation and then panic in case of violation?
If you mean checking the invariant, then this is not really possible. The variables on the right hand side are accessed atomically without holding any lock. The queue counts are protected by a R/W lock + spinlock combo -- unless one holds the write lock, they can change individually. There's no way one can get a snapshot of all four variables at the same time. Introducing a mutex to make that possible would kill all concurrency and thus probably make the bug disappear, if it is some kind of race condition in vm_page.cpp.
If hrev35295 introduced the bug, I don't think a problem outside of vm_page.cpp is likely. I just read through the patch again and it really doesn't change anything related other than adding passing of the priority argument in various functions and methods. No accidental replacement of vm_page_reserve_pages() by vm_page_unreserve_pages() or changing of the numbers passed to them.
comment:6 by , 15 years ago
Priority: | normal → high |
---|---|
Status: | new → in-progress |
Found a way to reproduce the problem seemingly reliably in qemu with 256 MB RAM and 2 GB image:
mkdir tt; cd tt for dir in $(seq 100); do mkdir $dir; pushd $dir; for file in $(seq 100); do echo Hello > $file; done; popd; done sync cat /dev/zero > tt [Ctrl-C after ca. 20s] rm -rf [0-9]* tt
comment:7 by , 15 years ago
Blocking: | 5331 removed |
---|
comment:8 by , 15 years ago
Just FYI, I understand the problem now: The invariant is actually never violated, since unreserved free + system reserve can become negative. The logical flaw is that the pages vm_page_unreserve_pages() unreserves cannot be just added to the system reserve, since that implies that that many pages are actually available, which might not be the case when the unreserved free count is already negative. This can be easily remedied by checking that condition first, but steal_pages() has a similar problem, which is not so easily fixed -- at least in a way that still makes the system reserve work as it should.
I will instead drop steal_pages() completely and rewrite the page daemon. But that may take a while.
by , 15 years ago
Attachment: | CalcSizePanic.jpg added |
---|
Panic when calculating the size of the source folder (src) Haiku.
follow-up: 10 comment:9 by , 15 years ago
Panic when calculating the size of the source folder (src) Haiku.
Haiku rev. 35340
comment:10 by , 15 years ago
Replying to damoklas:
Panic when calculating the size of the source folder (src) Haiku.
Haiku rev. 35340
Panic only when at the same time works transmissioncli v0.8.2 (torrent about 60 GB on another drive), the source Haiku on another partition.
comment:11 by , 15 years ago
@damoklas: Thanks, but no need to add more info to this ticket. The problem is understood and is being worked on.
comment:12 by , 15 years ago
Summary: | Had reserved page but threre is none after latest slab/vm changes → Had reserved page but there is none after latest slab/vm changes |
---|
Small correction to the summary, so it can be found more easily. (I encountered the same bug, but initially I didn't found this ticket.)
comment:13 by , 15 years ago
Cc: | added |
---|
comment:14 by , 15 years ago
Blocking: | 5356 added |
---|
(In #5356) Please try to at least lookup the panic line in the bug tracker. There's a good chance you find the problem this way if it was reported already. Also this is obviously not the installer crashing, which would be indicated by the normal app-crash-alert + gdb, but a kernel panic. Anyway, duplicate of #5328.
comment:15 by , 15 years ago
Blocking: | 5356 removed |
---|---|
Cc: | removed |
Resolution: | → fixed |
Status: | in-progress → closed |
Summary: | Had reserved page but there is none after latest slab/vm changes → Had reserved page but threre is none after latest slab/vm changes |
Unfortunately there's very little helpful information available at that point. "page_stats" will give you current page statistics. It will show that the clear and free queues are empty and probably also a violation of the page reservation invariant, which is:
Most of the time the left hand side and right hand side will be equal, only while allocating pages that will be a proper greater.
The policy for allocating pages is to first reserve as many as needed at maximum, allocate as many as needed, and unreserve the maximum again. The invariant can be violated by allocating more pages than reserved or by unreserving pages one has not reserved. Other than that only a bug in vm_page.cpp can cause that.
The slab changes are relatively unsuspicious in that respect -- there's only one place where pages are allocated (MemoryManager::_MapChunk()) and that looks OK. The most likely change for a potential bug is hrev35295, where I introduced the system reserve, which made the page reservation code in vm_page.cpp significantly more complicated. I've read through it several times without spotting a problem, though. Anyway, you could try the previous version and verify whether it is to blame.
Finally, there's PAGE_ALLOCATION_TRACING that could be enabled, but unless one has a concrete point in time when the suspected problem occurred, it is not particularly helpful.