Skip to main content

Operating Systems · 45 min

Filesystems

How bytes on disk become named files: inodes, directories, allocation, page cache, journaling, and the durability contract behind fsync().

Why This Matters

A 4096-byte disk block can hold one page of file data, one small directory fragment, or 1024 four-byte block pointers. If a process writes 17 bytes to a file, the kernel usually copies those bytes into the page cache and returns before the SSD has made the update durable.

That gap is where data loss bugs live. The API looks like write(fd, buf, 17), but the durable object is a sequence of metadata and data writes ordered by the filesystem, the block layer, the device cache, and sometimes a journal. For programs that write checkpoints, model artifacts, SQLite databases, or logs, fsync() is the barrier that turns a buffered update into a crash-surviving update.

Core Definitions

Definition

File

A file is a persistent byte sequence plus metadata. In Unix-style filesystems the metadata is stored in an inode, and the bytes are reached through block pointers or an extent map from logical file offsets to physical storage blocks.

Definition

Inode

An inode is the filesystem object that stores type, permissions, owner IDs, size, timestamps, link count, and pointers to file contents. It does not store the filename. Names live in directories.

Definition

Directory

A directory is a file whose contents encode a mapping from names to inode numbers. Path lookup walks these mappings component by component, for example usr, then bin, then python3.

Definition

Durability barrier

A durability barrier is an operation after which the filesystem promises that selected data and metadata will survive a crash, subject to device honesty. In POSIX-like systems, the file-specific primitive is fsync(fd).

Files as Inodes Plus Blocks

A small Unix inode is a fixed-size metadata record. Real filesystems vary, but the following 128-byte teaching inode is close enough to expose the shape.

OffsetBytesField
02mode, including file type and permissions
22link count
44owner uid
84group gid
128file size in bytes
208access time
288modification time
368inode change time
444812 direct block numbers, 4 bytes each
924single-indirect block number
964double-indirect block number
10028filesystem-specific padding or flags

For a 10,000-byte regular file on a 4096-byte-block filesystem, the inode size is 10,000. The file needs ceil(10000 / 4096) = 3 data blocks. Logical byte offsets map as follows.

File byte rangeLogical blockBytes usedInode pointer
0 to 409504096direct[0]
4096 to 819114096direct[1]
8192 to 999921808direct[2]

With 12 direct pointers and 4 KiB blocks, direct pointers cover 12 * 4096 = 49152 bytes. A single-indirect block holding 1024 four-byte block numbers covers 1024 * 4096 = 4194304 bytes. A double-indirect pointer covers 1024 * 1024 * 4096 = 4294967296 bytes. The inode can address about 4 GiB plus 4 MiB plus 48 KiB in this simplified design.

Many modern filesystems store extents instead of one pointer per block. An extent says that logical blocks 100 through 163 map to 64 consecutive physical blocks starting at block 900000. Extents compress metadata for large contiguous files and reduce pointer-block reads.

Directories, Paths, Links, and Caches

A directory entry binds a byte string name to an inode number. A small directory might contain these records.

OffsetInodeRecord lengthName lengthName
02121.
122122..
2430511165train
4030512166config

Resolving /home/alex/run/train starts at the root inode. The kernel reads the root directory and finds home. It reads that inode, checks that it is a directory and that the caller has search permission, then looks up alex, then run, then train. The final answer is an inode plus open-file state such as current offset and flags.

A hard link is another directory entry pointing at the same inode. If a and b are hard links to inode 30511, both names refer to the same data blocks and the same metadata. Removing one name decrements the inode link count. Storage is freed only when the link count reaches zero and no process still has the file open.

A symbolic link is a distinct inode whose contents are a path string. If latest contains run-042, opening latest/weights.bin substitutes that path during lookup. Symlinks can cross filesystems because they name paths, not inode numbers. They can also dangle.

The dentry cache stores recent name-to-inode lookup results, including misses. If a training loop opens the same shard names repeatedly, lookup may avoid disk reads even when the file data is not cached. The inode cache stores inode metadata. The page cache stores file contents.

Block Allocation and Fragmentation

The allocator chooses physical blocks for new file data. Suppose a filesystem has 16 data blocks and each character below marks one block.

0 1 2 3 4 5 6 7 8 9 A B C D E F
. . X X . . . X . . X . . . . X

A four-block append can be placed at blocks 4,5,6,8, which fits but fragments the file. A contiguous choice at B,C,D,E gives better sequential read behavior. If a later file needs five blocks, that choice was too greedy.

Classical allocators use locality groups, cylinder groups, bitmaps, free lists, and preallocation. Extent filesystems try to allocate runs of blocks. Fragmentation has two costs. Sequential IO turns into multiple smaller IOs, and metadata grows because more extents or indirect blocks are needed.

Internal fragmentation is different. A 1-byte file still occupies at least one data block unless the filesystem stores tiny files inline in the inode or in a tail-packing area. With 4096-byte blocks, one million 1-byte files consume about 4 GiB of data-block space, before inode tables and directories.

On SSDs, logical block adjacency is not the same as physical flash adjacency. The flash translation layer remaps writes internally. Still, larger sequential requests reduce command overhead and give the device more room to schedule work.

Page Cache and Write-Back

The page cache is RAM owned by the kernel and indexed by (inode, page_index). Reads fill it. Writes usually modify it. The disk write happens later.

#include <fcntl.h>
#include <string.h>
#include <unistd.h>

int main(void) {
    int fd = open("x.dat", O_CREAT | O_TRUNC | O_WRONLY, 0644);
    char buf[4096];
    memset(buf, 'A', sizeof(buf));

    write(fd, buf, sizeof(buf));   /* copies into kernel page cache */
    close(fd);                     /* does not imply durable storage */
    return 0;
}

After write(), the process buffer may be reused. The kernel has copied the data. The page is now dirty. A background flusher, memory pressure, or explicit sync operation later sends it to storage. If power fails between write() and the actual device write, the new bytes can be lost.

mmap() uses the same cache. A store to a memory-mapped file dirties a cached page. msync() requests write-back for a mapping, but database engines often prefer pwrite() plus fsync() because it makes the durability boundary easier to audit.

The page cache also explains why repeated reads are fast. The first read of a 1 GiB file may issue storage IO. The second read may copy from RAM. Benchmark code that does not control cache state often measures memory bandwidth rather than filesystem throughput.

Durability With fsync, sync, and O_DSYNC

fsync(fd) asks the filesystem to make the file's dirty data and required metadata durable. For a newly created file, the directory entry is separate metadata. The common crash-safe replace pattern syncs both the temporary file and the containing directory.

#include <fcntl.h>
#include <unistd.h>
#include <string.h>

int write_replace(const char *dir, const char *tmp, const char *dst) {
    int dfd = open(dir, O_RDONLY | O_DIRECTORY);
    int fd = openat(dfd, tmp, O_CREAT | O_TRUNC | O_WRONLY, 0644);

    const char payload[] = "epoch=17\nloss=0.834\n";
    if (write(fd, payload, strlen(payload)) < 0) return -1;
    if (fsync(fd) < 0) return -1;          /* file data and file metadata */
    if (close(fd) < 0) return -1;

    if (renameat(dfd, tmp, dfd, dst) < 0) return -1;
    if (fsync(dfd) < 0) return -1;         /* directory entry for dst */
    close(dfd);
    return 0;
}

sync() schedules write-back of all dirty filesystem data. It is system-wide and not a precise per-file commit point for application protocols. O_DSYNC makes each write wait for data and the metadata needed to retrieve it. It can be useful for append-only logs, but it changes every write's cost rather than placing barriers at selected transaction boundaries.

A storage device with a volatile write cache can still lie unless barriers, flush commands, or forced-unit-access writes are honored. Filesystems issue those commands when they implement fsync(), but the promise depends on the block layer and device behavior.

Crash Consistency Designs

A crash can split an operation. Imagine appending one block to a file. The filesystem must write the new data block, update the inode size, update a block bitmap, and perhaps update an indirect block. If the bitmap is durable but the inode is not, a block may be leaked. If the inode points to a block whose old contents remain, a reader may see stale data.

Journaling filesystems write a log record before modifying home locations. In metadata journaling, the journal might contain the new bitmap block, inode block, and directory block, but not the user data block. In data journaling, user data is journaled too, with higher write traffic. ext4 and NTFS use journaling, with details controlled by mount options and write ordering.

A simplified transaction is:

journal block 1000: BEGIN tx=77
journal block 1001: inode 30511 new size=8192
journal block 1002: bitmap block marks data block 9007 used
journal block 1003: COMMIT tx=77
home inode block: updated later
home bitmap block: updated later

After a crash, recovery scans the journal. If it sees BEGIN without COMMIT, it ignores the transaction. If it sees a committed transaction, it replays the logged metadata to the home blocks. Replaying twice must be harmless.

Copy-on-write filesystems, such as ZFS and btrfs, never overwrite some live metadata in place. They write new blocks, build new parent metadata pointing at them, and finally switch a root pointer. The tree before the switch remains a valid old version. The tree after the switch is the new version.

Log-structured filesystems write most updates sequentially to a log. Sprite LFS introduced this design for disks, and F2FS adapts it for flash storage. Cleaning reclaims old segments by copying live data elsewhere. The tradeoff is write coalescing against cleaner overhead.

VFS, Pseudo-Filesystems, and SSDs

The virtual filesystem layer gives user programs one API. open, read, write, rename, and fsync dispatch through per-filesystem operations. A process can call read() on ext4, XFS, tmpfs, procfs, or sysfs and still cross the same syscall boundary.

procfs and sysfs are not ordinary disk formats. Reading /proc/self/status asks kernel code to format process state as bytes. Writing to selected sysfs files invokes kernel handlers that change device or driver settings. The same pathname and file-descriptor interface is used, but persistence and durability are not the point.

SSDs add another layer. Flash pages are programmed in pages but erased in larger blocks. Overwriting a logical block usually means writing a new physical page and invalidating the old one. TRIM, exposed as discard in Linux, tells the device that logical blocks no longer contain live filesystem data. This gives the flash translation layer more freedom during garbage collection.

The Model

Two invariants are enough to debug many filesystem surprises.

First, a pathname is not a file identity. The identity is closer to (device, inode). A directory maps a name to that identity. rename() changes mappings atomically within one filesystem, while open file descriptors continue to refer to the old inode object.

Second, write() completion is not durability. A useful state machine for a newly written block is:

process buffer
  copy by write()
kernel page cache, dirty
  write-back submit
block device queue
  device accepts write
device nonvolatile media

The only application-level barrier in this chain is fsync(fd) for a file, plus fsync(dirfd) when the directory mapping itself must survive. close(fd) reports some delayed write errors on some systems, but it is not a durability primitive. sync() is too broad to mark one application transaction. O_DSYNC changes the write path, but each write still needs a carefully defined object whose durability matters.

For capacity planning, the addressability formula for the teaching inode is:

bytes=B(D+P+P2)\text{bytes} = B \cdot (D + P + P^2)

Here BB is block size, DD is the count of direct pointers, and P=B/4P = B / 4 for four-byte block numbers. With B=4096B = 4096 and D=12D = 12, this gives 4096 * (12 + 1024 + 1048576) = 4299210752 bytes.

Common Confusions

Watch Out

The filename is stored in the inode

The inode stores metadata and block mappings. The filename is stored in a directory entry. One inode can have several names through hard links, and an open file can keep existing after all names have been removed.

Watch Out

close() makes my write durable

close() releases a file descriptor. It can trigger write-back and can report an error, but portable crash recovery code must use fsync() before relying on durability. For create-and-rename protocols, sync the containing directory too.

Watch Out

A symlink is the same as a hard link

A hard link adds another directory entry to the same inode. A symlink is a separate file containing a path string. The symlink can cross a mount point and can point to a nonexistent target.

Watch Out

Journaling means user data is always safe

Metadata journaling protects filesystem structure. It does not always journal user data. After a crash, the file may have a valid size and block mapping while the last written bytes are old or zero-filled, depending on the mode and ordering.

Exercises

ExerciseCore

Problem

A teaching filesystem uses 4096-byte blocks, 10 direct pointers, one single-indirect pointer, and one double-indirect pointer. Block numbers are 4 bytes. Compute the maximum file size addressable by this inode layout.

ExerciseCore

Problem

A program creates model.tmp, writes 8192 bytes, calls rename("model.tmp", "model.bin"), and exits. No fsync() calls are made. List two crash outcomes allowed by the missing barriers. Then give a corrected sequence.

ExerciseAdvanced

Problem

A directory has two entries: a points to inode 70 and b points to inode 70. The inode link count is 2. A process opens a, then another process unlinks a and b. What happens to the inode and data blocks before and after the first process closes its file descriptor?

References

Canonical:

  • Arpaci-Dusseau and Arpaci-Dusseau, Operating Systems: Three Easy Pieces (2018), ch. 39-42 and ch. 48, file and directory abstractions, FFS, journaling, LFS
  • Silberschatz, Galvin, and Gagne, Operating System Concepts (10th ed., 2018), ch. 13-15, file-system interface, implementation, and internals
  • Tanenbaum and Bos, Modern Operating Systems (4th ed., 2015), ch. 4, files, directories, allocation, reliability, and performance
  • McKusick, Neville-Neil, and Watson, The Design and Implementation of the FreeBSD Operating System (2nd ed., 2014), ch. 8, local filesystems and vnode layer
  • Bach, The Design of the UNIX Operating System (1986), ch. 4-5, classic Unix inodes, buffers, and file-system calls
  • Love, Linux Kernel Development (3rd ed., 2010), ch. 12-13, VFS, page cache, and block IO

Accessible:

  • Arpaci-Dusseau and Arpaci-Dusseau, Operating Systems: Three Easy Pieces, open-access chapters on persistence
  • McKusick, "A Fast File System for UNIX", classic paper and talk material on locality and allocation
  • Linux man-pages project, open(2), fsync(2), rename(2), sync(2), and mount(8)

Next Topics

  • /computationpath/virtual-memory
  • /computationpath/storage-devices
  • /computationpath/databases-transactions
  • /computationpath/distributed-filesystems