Opened 4 months ago

Last modified 5 weeks ago

#18717 new bug

[regression] Stability issues in block_cache

Reported by: axeld Owned by: nobody
Priority: normal Milestone: R1/beta5
Component: System/Kernel Version: R1/Development
Keywords: Cc:
Blocked By: Blocking:
Platform: All

Description

I had hrev56721 and hrev56921 running for a couple of months on my NAS without any issues (it would run for weeks). Since I upgraded to hrev57387, I have stability issues without changing the usage pattern. It now maybe lasts for a week until it crashes, sometimes even more often.

Some KDL screenshots attached.

Attachments (7)

HaikuScreenshots-2023-11-18.png (148.0 KB ) - added by axeld 4 months ago.
HaikuScreenshots-2023-11-18-2.png (46.4 KB ) - added by axeld 4 months ago.
HaikuScreenshots-2023-12-09.png (37.9 KB ) - added by axeld 4 months ago.
HaikuScreenshots-2024-01-05.png (140.8 KB ) - added by axeld 4 months ago.
KDL
HaikuScreenshots-2024-01-05-2.png (58.0 KB ) - added by axeld 4 months ago.
End of syslog before KDL
HaikuScreenshots-2024-01-09.png (137.0 KB ) - added by axeld 4 months ago.
HaikuScreenshots-2024-01-19.png (53.0 KB ) - added by axeld 3 months ago.

Download all attachments as: .zip

Change History (24)

by axeld, 4 months ago

by axeld, 4 months ago

comment:1 by waddlesplash, 4 months ago

Milestone: UnscheduledR1/beta5

comment:2 by waddlesplash, 4 months ago

Summary: Stability issues[regression] Stability issues

comment:3 by waddlesplash, 4 months ago

Possibly some kind of memory corruption. The x86VMTranslationMap assert has other tickets but they may always just be due to memory issues.

comment:4 by waddlesplash, 4 months ago

Any chance you can try to narrow down the problem range further?

comment:5 by waddlesplash, 4 months ago

Do you have any custom kernel add-ons by any chance? I think there were some "minor" ABI changes in there (to condition variable struct sizes, at least.)

Also in that range:

  • clear page queue ordering (hrev57360)
  • VM changes to cut_area (hrev57096 and ~1) and unlock caches before unmapping addresses (hrev57062)
  • user_mutex refactor (multiple hrevs)

(That's just for the kernel itself, I didn't skim through driver changes. )

But as noted above, I highly suspect this is just some kind of memory corruption. What drivers are loaded after booting?

comment:6 by axeld, 4 months ago

Sorry, I didn't get or see any notification mail. I do actually use the DriveEncryption driver which, if I'm not mistaken, does make use of kernel locking facilities. I'll try to recompile that one with a newer kernel, and see if that changes anything.

comment:7 by axeld, 4 months ago

I have switched to hrev57493, and recompiled the driver a couple of days ago.

Today I first got strange problems like "I/O error" or "File not exists error" while I could open the file just fine, and also Tracker could open the directory that caused the latter error. Then later, the system crashed with the attached output.

by axeld, 4 months ago

KDL

by axeld, 4 months ago

End of syslog before KDL

comment:8 by axeld, 4 months ago

It might still be caused by that driver, actually. Do you know offhand what kernel config is being used for the nightlies? Does that end up somewhere on the image (the kernel_debug_config.h file)? And how can I find out?

I used KERNEL_DEBUG 2 to compile that driver, so that might not have been compatible.

comment:9 by waddlesplash, 4 months ago

KDEBUG_LEVEL 2 is correct. But are you not using the exact same headers/sources to compile the driver as the nightly you are running on, including the file where KDEBUG_LEVEL is defined?

A thought occurs to me: within that range, the size of the ConditionVariable class changed, and with it the IORequest classes as they contain ConditionVariable members. The "I/O error" and the like might be caused by those malfunctioning, and then the memory corruptions also.

comment:10 by axeld, 4 months ago

I use the headers as they come with the Haiku package. But this does not define the correct KDEBUG_LEVEL; I use my own kernel_header_config.h file for that. I don't use IORequests in that driver yet, it uses standard read/write hooks.

It does make use of condition variables, though. But any issues there should be fixed by a recompile.

Last edited 4 months ago by axeld (previous) (diff)

comment:11 by waddlesplash, 4 months ago

I made changes a while ago that were supposed to make KDEBUG_LEVEL 2 binaries ABI compatible with KDEBUG_LEVEL 0 kernels, at least as far as locks go (note that the reverse is not true, however.)

Are you sure that your Haiku _devel package is up to date with the main Haiku package? The ABI change to ConditionVariable was hrev57320.

comment:12 by axeld, 4 months ago

I have not used the driver in this session, but today it crashed again, after about two days of uptime. Screenshot attached.

by axeld, 4 months ago

comment:13 by waddlesplash, 4 months ago

Do you have any other drivers in "non-packaged" that are being loaded/used? (You can check with 'listimage'.)

comment:14 by axeld, 4 months ago

I have no non-packaged drivers at all, and "encrypted_drive" is the only foreign driver.

comment:15 by axeld, 4 months ago

I'm currently at 12 days uptime with the driver in use, but I have to restart the host today.

comment:16 by axeld, 3 months ago

I removed the driver, and just updated to the current revision. Immediately after the reboot, I got greeted by this KDL.

by axeld, 3 months ago

comment:17 by pulkomandy, 5 weeks ago

Summary: [regression] Stability issues[regression] Stability issues in block_cache

all but one of the attached screenshots are either in block_cache_get -> object_cache_alloc dereferencing a NULL pointer, or block_cache_put failing to write back some blocks.

I'm adjusting the ticket title, although it could either be a bug in the block cache, or possibly in the underlying disk IO.

Note: See TracTickets for help on using tickets.