File System Implementation
To think about file systems, we usually suggest thinking about two different aspects of them.
The first is the data structures of the file system. The second aspect of a file system is its access methods
- S, superblock, contains information about this particular file system
- i, inode bitmap
- d, data bitmap
- I, inode table, metadata
- D, data
Truths of FS:
- Most files are small
- Average file size is growing
- Most bytes are stored in large files
- File systems contains lots of files
- File systems are roughly half full
- Directories are typically small
A directory basically just contains a list of (entry name, inode number) pairs.
Often, file systems treat directories as a special type of file. Thus, a directory has an inode, somewhere in the inode table
Access Paths: Reading and Writing
Reading and writing files can be expensive, incurring many I/Os to the (slow) disk. To remedy what would clearly be a huge performance problem, most file systems aggressively use system memory (DRAM) to cache important blocks.
Write buffering, by delaying writes, the file system can batch some updates into a smaller set of I/Os.
Locality and The Fast File System
Simple FS pain point: fragmentation
Fast File System (FFS) design the file system structures and allocation policies to be “disk aware” and thus improve performance
FFS divides the disk into a number of cylinder groups and places files in the same group based on locality (say, in the same directory) which may be calculated using algorithm
If the chunk size is large enough, the file system will spend most of its time transferring data from disk and just a (relatively) little time seeking between chunks of the block. This process of reducing an overhead by doing more work per overhead paid is called amortization and is a common technique in computer systems.
Crash Consistency: FSCK and Journaling
How to deal with data inconsistency when crashing?
- FSCK (file system checker, validate FS data based on certain rules)
- Journaling (also known as write-ahead logging)
Just like database transaction
transaction begin → inode logging →bitmap logging → data block logging → transaction end
- Journal write: Write the contents of the transaction (containing TxB and the contents of the update) to the log; wait for these writes to complete.
- Journal commit: Write the transaction commit block (containing TxE) to the log; wait for the write to complete; the transaction is now committed.
- Checkpoint: Write the contents of the update to their final locations within the file system.
- Free: Some time later, mark the transaction free in the journal by updating the journal superblock.
A simpler (and more common) form of journaling is sometimes called ordered journaling (or just metadata journaling), and it is nearly the same, except that user data is not written to the journal.
- Data write: Write data to the final location; wait for completion (the wait is optional; see below for details).
- Journal metadata write: Write the begin block and metadata to the log; wait for writes to complete.
- Journal commit: Write the transaction commit block (containing TxE) to the log; wait for the write to complete; the transaction (including data) is now committed.
- Checkpoint metadata: Write the contents of the metadata update to their final locations within the file system.
- Free: Later, mark the transaction free in the journal superblock.
Copy-on-write (COW) never overwrites files or directories in place; rather, it places new updates to previously unused locations on disk.
Log-structured File Systems
This basic idea, of simply writing all updates (such as data blocks, inodes, etc.) to the disk sequentially, sits at the heart of LFS. The large chunk of updates LFS writes at one time is referred to by the name of a segment.
- CR checkpoint region contains pointers to imap
- imap inodes map maps inode to its address
- inode points to data blocks
Dir works the same way as other FS, just (name,inode) pair
LFS introduces a new approach to updating the disk. Instead of overwriting files in places, LFS always writes to an unused portion of the disk, and then later reclaims that old space through cleaning. This approach, which in database systems is called shadow paging and in file-system-speak is sometimes called copy-on-write, enables highly efficient writing, as LFS can gather all updates into an in-memory segment and then write them out together sequentially.
solid-state storage device, flash-based
A flash chip consists of many banks, each of which is organized into erase blocks (sometimes just called blocks). Each block is further subdivided into some number of pages. Blocks are large (128KB–2MB) and contain many pages, which are relatively small (1KB–8KB).
Basic Flash Operations: The read (a page) command is used to read a page from the flash; erase (a block) and program (a page) are used in tandem to write
Wear out: after a limited number of erase and program, ssd cell may be broken
Wear leveling: load balance write requests to all flash blocks so they have roughly same life span
Data Integrity and Protection
Detecting Corruption: The Checksum
The primary mechanism used by modern storage systems to preserve data integrity is called the checksum. A checksum is simply the result of a function that takes a chunk of data (say a 4KB block) as input and computes a function over said data, producing a small summary of the contents of the data (say 4 or 8 bytes) . This summary is referred to as the checksum.
Common functions: XOR-based, addition, cyclic redundancy check (CRC, binary modulo operation)