I/O Devices

System Architecture

截屏2023-06-11 11.44.51.png

The picture shows a single CPU attached to the main memory of the system via some kind of memory bus or interconnect. Some devices are connected to the system via a general I/O bus, which in many modern systems would be PCI (or one of its many derivatives); graphics and some other higher-performance I/O devices might be found here. Finally, even lower down are one or more of what we call a peripheral bus, such as SCSI, SATA, or USB. These connect slow devices to the system, including disks, mice, and keyboards.

One question you might ask is: why do we need a hierarchical structure like this? Put simply: physics, and cost. The faster a bus is, the shorter it must be; thus, a high-performance memory bus does not have much room to plug devices and such into it. In addition, engineering a bus for high performance is quite costly. Thus, system designers have adopted this hierarchical approach, where components that demand high performance (such as the graphics card) are nearer the CPU. Lower performance components are further away. The benefits of placing disks and other slow devices on a peripheral bus are manifold; in particular, you can place a large number of devices on it.

截屏2023-06-11 11.51.11.png

A Canonical Device

A device has two important components. The first is the hardware interface it presents to the rest of the system. Just like a piece of software, hardware must also present some kind of interface that allows the system software to control its operation. Thus, all devices have some specified interface and protocol for typical interaction.

The second part of any device is its internal structure. This part of the device is implementation specific and is responsible for implementing the abstraction the device presents to the system. Very simple devices will have one or a few hardware chips to implement their functionality; more complex devices will include a simple CPU, some general purpose memory, and other device-specific chips to get their job done.

截屏2023-06-11 11.53.35.png

The Canonical Protocol

In the picture above, the (simplified) device interface is comprised of three registers:

a status register, which can be read to see the current status of the device;
a command register, to tell the device to perform a certain task;
a data register to pass data to the device, or get data from the device.

By reading and writing these registers, the operating system can control device behavior.

A typical interaction that the OS might have with the device:

While (STATUS == BUSY)
; // wait until device is not busy
Write data to DATA register
Write command to COMMAND register
(starts the device and executes the command)
While (STATUS == BUSY)
; // wait until device is done with your request

Fitting Into The OS: The Device Driver

The problem is solved through the age-old technique of abstraction. At the lowest level, a piece of software in the OS must know in detail how a device works. We call this piece of software a device driver, and any specifics of device interaction are encapsulated within.

截屏2023-06-11 12.04.30.png

Note that the encapsulation seen above can have its downside as well. For example, if there is a device that has many special capabilities, but has to present a generic interface to the rest of the kernel, those special capabilities will go unused.

Interestingly, because device drivers are needed for any device you might plug into your system, over time they have come to represent a huge percentage of kernel code. Studies of the Linux kernel reveal that over 70% of OS code is found in device drivers.

Hard Disk Drives

Latency

Single-track Latency: The Rotational Delay
Multiple Tracks: Seek Time
Data transfer time

T_{I/O} = T_{seek} + T_{rotation} + T_{transfer}

TIP: USE DISKS SEQUENTIALLY

When at all possible, transfer data to and from disks in a sequential manner. If sequential is not possible, at least think about transferring data in large chunks: the bigger, the better. If I/O is done in little random pieces, I/O performance will suffer dramatically.

Disk Scheduling

given a set of I/O requests, the disk scheduler examines the requests and decides which one to schedule next

SJF (shortest job first)
SSTF: Shortest Seek Time First
Elevator (a.k.a. SCAN or C-SCAN)
SPTF: Shortest Positioning Time First

Redundant Arrays of Inexpensive Disks (RAIDs)

a technique to use multiple disks in concert to build a faster, bigger, and more reliable disk system.

Externally, a RAID looks like a disk: a group of blocks one can read or write. Internally, the RAID is a complex beast, consisting of multiple disks, memory (both volatile and non-), and one or more processors to manage the system.

RAIDs offer a number of advantages over a single disk.

One advantage is performance. Using multiple disks in parallel can greatly speed up I/O times.
Another benefit is capacity. Large data sets demand large disks.
Finally, RAIDs can improve reliability; spreading data across multiple disks (without RAID techniques) makes the data vulnerable to the loss of a single disk; with some form of redundancy, RAIDs can tolerate the loss of a disk and keep operating as if nothing were wrong.

Types:

RAID Level 0: Striping
RAID Level 1: Mirroring
RAID Level 4: Saving Space With Parity
RAID Level 5: Rotating Parity

截屏2023-06-11 12.56.14.png

Interlude: Files and Directories

In this section, we add one more critical piece to the virtualization puzzle: persistent storage. A persistent-storage device, such as a classic hard disk drive or a more modern solid-state storage device, stores information permanently (or at least, for a long time).

Files And Directories

A file is simply a linear array of bytes, each of which you can read or write. Each file has some kind of low-level name, usually a number of some kind; For historical reasons, the low-level name of a file is often referred to as its inode number.

In most systems, the OS does not know much about the structure of the file (e.g., whether it is a picture, or a text file, or C code) ; rather, the responsibility of the file system is simply to store such data persistently on disk and make sure that when you request the data again, you get what you put there in the first place. Doing so is not as simple as it seems!

The second abstraction is that of a directory. A directory, like a file, also has a low-level name (i.e., an inode number), but its contents are quite specific: it contains a list of (user-readable name, low-level name) pairs.

The File System Interface

create or open

This can be accomplished with the open system call; by calling open() and passing it the O CREAT flag, a program can create a new file.

int fd = open("foo", O_CREAT|O_WRONLY|O_TRUNC, S_IRUSR|S_IWUSR);

The strace tool (dtruss on a Mac) is indeed a powerful utility for tracing system calls made by a program. It allows you to observe the interactions between the program and the operating system, providing insights into its behavior and resource usage.

When running strace, you can specify various arguments to customize its behavior. Some useful arguments include:

f: This option follows any child processes created by the traced program, allowing you to trace their system calls as well.
t: It displays the time of day for each system call, giving you a timestamp alongside the call information.
e trace=open,close,read,write: With this argument, strace only traces the specified system calls (open, close, read, write), ignoring all others. You can customize the list of system calls to trace based on your specific needs.

There are many more flags and options available for strace, so I recommend referring to the man pages (man strace) for a comprehensive understanding of its capabilities and how to use them effectively.

read or write

use cat as an example cause it first read the file then write its content to standard output

prompt> strace cat foo
-----------------------------
open("foo", O_RDONLY|O_LARGEFILE) = 3
read(3, "hello\n", 4096) = 6
write(1, "hello\n", 6) = 6
hello
read(3, "", 4096) = 0
close(3) = 0

lseek

Sometimes, however, it is useful to be able to read or write to a specific offset within a file; you may end up reading from some random offsets within the document. To do so, we will use the lseek() system call.

off_t lseek(int fildes, off_t offset, int whence);

for each file a process opens, the OS tracks a “current” offset, which determines where the next read or write will begin reading from or writing to within the file. Thus, part of the abstraction of an open file is that it has a current offset, which is updated in one of two ways.

The first is when a read or write of N bytes takes place, N is added to the current offset; thus each read or write implicitly updates the offset.

The second is explicitly with lseek, which changes the offset as specified above. The offset, as you might have guessed, is kept in that struct file we saw earlier, as referenced from the struct proc.

struct file:

struct file {
    int ref;
    char readable;
    char writable;
    struct inode *ip;
    uint off;
};

file structures represent all of the currently opened files in the system; together, they are sometimes referred to as the open file table. The xv6 kernel just keeps these as an array as well, with one lock per entry

struct {
  struct spinlock lock;
  struct file file[NFILE];
} ftable;

opening the same file twice refers to a different entry in the open file table!

fork() And dup()

Shared File Table Entries

截屏2023-06-11 13.19.58.png

fsync()

Writing immediately to disk instead of caching in buffer

Renaming Files

Using strace, we can see that mv uses the system call rename(char *old, char *new), which takes precisely two arguments: the original name of the file (old) and the new name (new).

prompt> mv foo bar
-----------------------
int fd = open("foo.txt.tmp", O_WRONLY|O_CREAT|O_TRUNC, S_IRUSR|S_IWUSR);
write(fd, buffer, size); // write out new version of file
fsync(fd);
close(fd);
rename("foo.txt.tmp", "foo.txt");

One interesting guarantee provided by the rename() call is that it is (usually) implemented as an atomic call with respect to system crashes; if the system crashes during the renaming, the file will either be named the old name or the new name, and no odd in-between state can arise.

This last step atomically swaps the new file into place, while concurrently deleting the old version of the file, and thus an atomic file update is achieved.

Getting Information About Files

Beyond file access, we expect the file system to keep a fair amount of information about each file it is storing. We generally call such data about files metadata. To see the metadata for a certain file, we can use the stat() or fstat() system calls. These calls take a pathname (or file descriptor) to a file and fill in a stat structure.

struct stat {
  dev_t st_dev; // ID of device containing file
  ino_t st_ino; // inode number
  mode_t st_mode; // protection
  nlink_t st_nlink; // number of hard links
  uid_t st_uid; // user ID of owner
  gid_t st_gid; // group ID of owner
  dev_t st_rdev; // device ID (if special file)
  off_t st_size; // total size, in bytes
  blksize_t st_blksize; // blocksize for filesystem I/O
  blkcnt_t st_blocks; // number of blocks allocated
  time_t st_atime; // time of last access
  time_t st_mtime; // time of last modification
  time_t st_ctime; // time of last status change
};

Removing Files

prompt> strace rm foo
---------------------
unlink("foo")

Making Directories

To create a directory, a single system call, mkdir(), is available.

Reading Directories

opendir(), readdir(), and closedir()

directory entry struct:

struct dirent {
  char d_name[256]; // filename
  ino_t d_ino; // inode number
  off_t d_off; // offset to the next dirent
  unsigned short d_reclen; // length of this record
  unsigned char d_type; // type of file
};

Because directories are light on information (basically, just mapping the name to the inode number, along with a few other details), a program may want to call stat() on each file to get more information on each, such as its length or other detailed information. Indeed, this is exactly what ls does when you pass it the -l flag; try strace on ls with and without that flag to see for yourself.

Deleting Directories

rmdir()

links

Hard Links

refer to inode

When you create a file, you are really doing two things.

First, you are making a structure (the inode) that will track virtually all relevant information about the file, including its size, where its blocks are on disk, and so forth.

Second, you are linking a human-readable name to that file, and putting that link into a directory.

After creating a hard link to a file, to the file system, there is no difference between the original file name (file) and the newly created file name (file2); indeed, they are both just links to the underlying metadata about the file

ln file file2

The link() system call takes two arguments, an old pathname and a new one; when you “link” a new file name to an old one, you essentially create another way to refer to the same file.

Thus, to remove a file from the file system, we call unlink(). when the file system unlinks file, it checks a reference count within the inode number. You can see the reference count of a file using stat()

Symbolic Links

just a shortcut

prompt> echo hello > file
prompt> ln -s file file2
prompt> cat file2
hello

Making And Mounting A File System

mkfs

To make a file system, most file systems provide a tool, usually referred to as mkfs (pronounced ‘make fs’), that performs exactly this task.

The idea is as follows: give the tool, as input, a device (such as a disk partition, e.g., /dev/sda1) and a file system type (e.g., ext3), and it simply writes an empty file system, starting with a root directory, onto that disk partition. And mkfs said, let there be a file system!

mount

Once such a file system is created, it needs to be made accessible within the uniform file-system tree. This task is achieved via the mount program (which makes the underlying system call mount() to do the real work).

What mount does, quite simply, is take an existing directory as a target mount point and essentially paste a new file system onto the directory tree at that point.

use mount to show a whole number of different file systems

目录CONTENT

OSTEP-notes-persisitence-chap35-to-39