File System Implementation

To think about file systems, we usually suggest thinking about two different aspects of them.

The first is the data structures of the file system. The second aspect of a file system is its access methods

Overall Organization

截屏2023-06-18 15.49.37.png

S, superblock, contains information about this particular file system
i, inode bitmap
d, data bitmap
I, inode table, metadata
D, data

截屏2023-06-18 15.53.42.png

Truths of FS:

Most files are small
Average file size is growing
Most bytes are stored in large files
File systems contains lots of files
File systems are roughly half full
Directories are typically small

Directory Organization

A directory basically just contains a list of (entry name, inode number) pairs.

Often, file systems treat directories as a special type of file. Thus, a directory has an inode, somewhere in the inode table

Access Paths: Reading and Writing

截屏2023-06-18 15.59.42.png

截屏2023-06-18 16.00.25.png

Reading and writing files can be expensive, incurring many I/Os to the (slow) disk. To remedy what would clearly be a huge performance problem, most file systems aggressively use system memory (DRAM) to cache important blocks.

Write buffering, by delaying writes, the file system can batch some updates into a smaller set of I/Os.

Locality and The Fast File System

Simple FS pain point: fragmentation

Fast File System (FFS) design the file system structures and allocation policies to be “disk aware” and thus improve performance

FFS divides the disk into a number of cylinder groups and places files in the same group based on locality (say, in the same directory) which may be calculated using algorithm

If the chunk size is large enough, the file system will spend most of its time transferring data from disk and just a (relatively) little time seeking between chunks of the block. This process of reducing an overhead by doing more work per overhead paid is called amortization and is a common technique in computer systems.

Crash Consistency: FSCK and Journaling

How to deal with data inconsistency when crashing?

FSCK (file system checker, validate FS data based on certain rules)
Journaling (also known as write-ahead logging)

Data Journaling

Just like database transaction

transaction begin → inode logging →bitmap logging → data block logging → transaction end

截屏2023-06-18 16.25.32.png

Journal write: Write the contents of the transaction (containing TxB and the contents of the update) to the log; wait for these writes to complete.
Journal commit: Write the transaction commit block (containing TxE) to the log; wait for the write to complete; the transaction is now committed.
Checkpoint: Write the contents of the update to their final locations within the file system.
Free: Some time later, mark the transaction free in the journal by updating the journal superblock.

Metadata Journaling

A simpler (and more common) form of journaling is sometimes called ordered journaling (or just metadata journaling), and it is nearly the same, except that user data is not written to the journal.

Data write: Write data to the final location; wait for completion (the wait is optional; see below for details).
Journal metadata write: Write the begin block and metadata to the log; wait for writes to complete.
Journal commit: Write the transaction commit block (containing TxE) to the log; wait for the write to complete; the transaction (including data) is now committed.
Checkpoint metadata: Write the contents of the metadata update to their final locations within the file system.
Free: Later, mark the transaction free in the journal superblock.

Other Approaches

Copy-on-write (COW) never overwrites files or directories in place; rather, it places new updates to previously unused locations on disk.

Log-structured File Systems

This basic idea, of simply writing all updates (such as data blocks, inodes, etc.) to the disk sequentially, sits at the heart of LFS. The large chunk of updates LFS writes at one time is referred to by the name of a segment.

CR checkpoint region contains pointers to imap
imap inodes map maps inode to its address
inode points to data blocks

截屏2023-06-18 16.37.43.png

Dir works the same way as other FS, just (name,inode) pair

截屏2023-06-18 16.37.54.png

LFS introduces a new approach to updating the disk. Instead of overwriting files in places, LFS always writes to an unused portion of the disk, and then later reclaims that old space through cleaning. This approach, which in database systems is called shadow paging and in file-system-speak is sometimes called copy-on-write, enables highly efficient writing, as LFS can gather all updates into an in-memory segment and then write them out together sequentially.

Flash-based SSDs

solid-state storage device, flash-based

A flash chip consists of many banks, each of which is organized into erase blocks (sometimes just called blocks). Each block is further subdivided into some number of pages. Blocks are large (128KB–2MB) and contain many pages, which are relatively small (1KB–8KB).

Basic Flash Operations: The read (a page) command is used to read a page from the flash; erase (a block) and program (a page) are used in tandem to write

截屏2023-06-18 16.52.14.png

Wear out: after a limited number of erase and program, ssd cell may be broken

Wear leveling: load balance write requests to all flash blocks so they have roughly same life span

Data Integrity and Protection

Detecting Corruption: The Checksum

The primary mechanism used by modern storage systems to preserve data integrity is called the checksum. A checksum is simply the result of a function that takes a chunk of data (say a 4KB block) as input and computes a function over said data, producing a small summary of the contents of the data (say 4 or 8 bytes) . This summary is referred to as the checksum.

Common functions: XOR-based, addition, cyclic redundancy check (CRC, binary modulo operation)

目录CONTENT

OSTEP-notes-persisitence-chap40-to-46