Opened 40 hours ago

Last modified 37 hours ago

#19213 new bug

utimensat assigns arbitrary subsecond timestamps

Reported by: bhaible Owned by: axeld
Priority: normal Milestone: Unscheduled
Component: File Systems/BFS Version: R1/beta5
Keywords: Cc: axeld
Blocked By: Blocking: #19212, #19214, #19215
Platform: All

Description

The utimensat() function, specified by POSIX https://pubs.opengroup.org/onlinepubs/9799919799/functions/utimensat.html, ought to set the access time and modification time of a file to the specified values. The values consist of an entire seconds part (tv_sec) and a subsecond part (tv_nsec).

On Haiku (I'm using a July 2024 snapshot) the subsecond part is ignored, and instead an arbitrary value is used. If this arbitrary value was the same for all files, it would be OK. But no, the arbitrary value are different!

This has the effect that files that should have the same modification time suddenly have different modification times. And thus, when I have a Makefile rule that generates one of the files from the other one, this Makefile rule will trigger, although it shouldn't.

How to reproduce: Save fu.c (attached).

$ gcc -Wall fu.c
$ ./a.out
$ ls -lrt --full-time dummy*
-rw-r--r-- 1 user root 0 2024-10-30 16:48:30.019922944 +0100 dummy0
-rw-r--r-- 1 user root 0 2024-10-30 16:48:30.020185088 +0100 dummy6
-rw-r--r-- 1 user root 0 2024-10-30 16:48:30.020185088 +0100 dummy5
-rw-r--r-- 1 user root 0 2024-10-30 16:48:30.020185088 +0100 dummy4
-rw-r--r-- 1 user root 0 2024-10-30 16:48:30.020185088 +0100 dummy3
-rw-r--r-- 1 user root 0 2024-10-30 16:48:30.020185088 +0100 dummy2
-rw-r--r-- 1 user root 0 2024-10-30 16:48:30.020185088 +0100 dummy1
-rw-r--r-- 1 user root 0 2024-10-30 16:48:30.020447232 +0100 dummy9
-rw-r--r-- 1 user root 0 2024-10-30 16:48:30.020447232 +0100 dummy8
-rw-r--r-- 1 user root 0 2024-10-30 16:48:30.020447232 +0100 dummy7

You can see that dummy0 and dummy9, for example, have different time stamps.

Attachments (2)

fu.c (531 bytes ) - added by bhaible 40 hours ago.
test case fu.c
testdir2.tar.gz (178.8 KB ) - added by bhaible 38 hours ago.
test case for GNU tar

Download all attachments as: .zip

Change History (13)

by bhaible, 40 hours ago

Attachment: fu.c added

test case fu.c

comment:1 by waddlesplash, 39 hours ago

Blocking: 19212 added

comment:2 by waddlesplash, 39 hours ago

Blocking: 19215 added

comment:3 by waddlesplash, 39 hours ago

Blocking: 19214 added

comment:4 by waddlesplash, 39 hours ago

Component: System/POSIXFile Systems/BFS
Owner: changed from nobody to axeld

This is due to logic in BFS that assigns different nsec values when no nsecs are specified. If you specify a nonzero nsec value, then you should wind up with that value set exactly.

comment:5 by bhaible, 39 hours ago

Source code analysis:

utimensat is defined in src/system/libroot/posix/sys/utimes.c. The essential code for the modification time is:

                if (times[1].tv_nsec != UTIME_OMIT) {
                        mask |= B_STAT_MODIFICATION_TIME;

                        if (times[1].tv_nsec != UTIME_NOW) {
                                if (times[1].tv_nsec < 0 || times[1].tv_nsec > 999999999)
                                        RETURN_AND_SET_ERRNO(EINVAL);
                        }

                        stat.st_mtim = times[1];
                }
        ...
        status = _kern_write_stat(fd, path, (flag & AT_SYMLINK_NOFOLLOW) == 0,
                &stat, sizeof(struct stat), mask);

For the Be file system, bfs_write_stat is defined in src/add-ons/kernel/file_systems/bfs/kernel_interface.cpp. There the code for the modification time is:

                if (!inode->InLastModifiedIndex()) {
                        // directory modification times are not part of the index
                        node.last_modified_time
                                = HOST_ENDIAN_TO_BFS_INT64(bfs_inode::ToInode(stat->st_mtim));
                } else if (!inode->IsDeleted()) {
                        // Index::UpdateLastModified() will set the new time in the inode
                        Index index(volume);
                        index.UpdateLastModified(transaction, inode,
                                bfs_inode::ToInode(stat->st_mtim));
                }

bfs_inode::ToInode is defined in src/add-ons/kernel/file_systems/bfs/bfs.h:

        static int64 ToInode(bigtime_t time)
                { return ((time / 1000000) << INODE_TIME_SHIFT)
                        + unique_from_nsec((time % 1000000) * 1000); }
        static int64 ToInode(const timespec& tv)
                { return ((int64)tv.tv_sec << INODE_TIME_SHIFT)
                        + unique_from_nsec(tv.tv_nsec); }

and unique_from_nsec is in the same file:

/*!     Converts the nano seconds given to the internal 16 bit resolution that
        BFS uses. If \a time is zero, 12 bits will get a monotonically increasing
        number. For all other values, only the lower 4 bits are changed this way.

        This is done to decrease the number of duplicate time values, which speeds
        up the way BFS handles the time indices.
*/
inline uint32
unique_from_nsec(uint32 time)
{
        static vint32 number;
        if (time != 0)
                return (((time + 16383) >> 14) & INODE_TIME_MASK) | (++number & 0xf);

        return ++number & 0xfff;
}

This means that this file system intentionally prevents a program from creating 10 files with the same time stamp.

It intentionally prevents GNU tar (and unzip etc.) from extracting files with correct time stamps.

This is fundamentally broken.

Last edited 38 hours ago by bhaible (previous) (diff)

comment:6 by bhaible, 39 hours ago

A correct way to add information for speed-up is to do so in a field that is not visible through 'ls -l --full-time'. It could be a separate field. Or the file system could store a time stamp with picoseconds, in such a way that the nanoseconds are visible through system call and the sub-nanosecond part (picoseconds modulo 1000) are hidden from the POSIX APIs.

comment:7 by waddlesplash, 39 hours ago

Cc: axeld added

There appears to be some logic in here to try and avoid these randomization values showing back up in userland, but I suppose it only works when the time specified is nonzero in the first place.

We can't change the BFS on-disk format at this point, only what we store in it and how we process that information. So storing picoseconds won't be possible.

axeld, it looks like you introduced this code. Any ideas of how we might solve this?

by bhaible, 38 hours ago

Attachment: testdir2.tar.gz added

test case for GNU tar

comment:8 by bhaible, 38 hours ago

This started with a file 'testdir2.tar.gz', that contains a file named aclocal.m4 and a file named Makefile.in. In the tarball they have the same time stamp. When extracted on Haiku (Be FS), the file aclocal.m4 is newer than Makefile.in. The build then tries to rebuild Makefile.in, by invoking automake-1.17. This should not happen. This Makefile rule is not meant to be triggered if the tarball was simply extracted and no file was modified.

Last edited 38 hours ago by bhaible (previous) (diff)

comment:9 by bhaible, 38 hours ago

Any ideas of how we might solve this?

Change the stat, fstat, lstat etc. system calls so that the lowest 16 bits of the st_crtim, st_mtim, st_ctim are cleared. These are the only bits that are affected by the unique_from_nsec part of

        static int64 ToInode(const timespec& tv)
                { return ((int64)tv.tv_sec << INODE_TIME_SHIFT)
                        + unique_from_nsec(tv.tv_nsec); }
Version 0, edited 38 hours ago by bhaible (next)

comment:10 by waddlesplash, 38 hours ago

The lowest 16 bits are where the nsec/usec are stored, if we clear them then they'll always be 0. This isn't what we want.

It appears the low 4 bits are the part randomized when time is nonzero. Maybe we can get away with only randomizing the lowest 4 bits all the time?

comment:11 by bhaible, 37 hours ago

Are you willing to change the definition of the unique_from_nsec inline function?

If no, then clearing the lowest 16 bits is the only solution. This reduces the file system time stamp resolution to 1 second. Which is what the systems had in the 1980ies and 1990ies. It's not a high resolution, but it is POSIX compliant.

If yes, then you can decide how many of the bits represent actual time, and how many are randomization "for speeding up the time indices". For example 10 bits for time, and 6 bits of randomization. Clear the lowest 6 bits in the stat[at]() system calls. This would give a file system time stamp resolution to 1 millisecond. What is exactly the performance problem with the time indices, and how many bits are needed to mitigate it?

Note: See TracTickets for help on using tickets.