Designing Data-Intensive Applications (DDIA) is my book of 2023. I originally chose this book simply because it’s the subject of our internal book sharing session and also one of the most famous books in the industry. 2023 is a year of pain and growth for me as new graduate. Having gone through time of doubt both on personal and work issues, now I come back fresh and strong.
My personal plan for 2024 is:
- The first half of 2024, I will focus on reading. The key objective is finishing all suggested CS course. Database and distributed system are the major ones left. Maybe play with some toy project if have some time left.
- The second half will be investigation and application on some popular tech. The key objective is building something interesting and new on my own and sharing it on github.
To begin with, let me summarize what I have learned in DDIA.
CHAPTER 1 Reliable, Scalable, and Maintainable Applications
Many applications today are data-intensive, as opposed to compute-intensive. Raw CPU power is rarely a limiting factor for these applications—bigger problems are usually the amount of data, the complexity of data, and the speed at which it is changing.
In this book, we focus on three concerns that are important in most software systems:
The system should continue to work correctly (performing the correct function at the desired level of performance) even in the face of adversity (hardware or soft‐ ware faults, and even human error).
As the system grows (in data volume, traffic volume, or complexity), there should be reasonable ways of dealing with that growth.
First, we need to succinctly describe the current load on the system. Load can be described with a few numbers which we call load parameters. The best choice of parameters depends on the architecture of your system
Once you have described the load on your system, you can investigate what happens when the load increases. You can look at it in two ways:
- When you increase a load parameter and keep the system resources (CPU, mem‐ ory, network bandwidth, etc.) unchanged, how is the performance of your system affected?
- When you increase a load parameter, how much do you need to increase the resources if you want to keep performance unchanged?
In a batch processing system such as Hadoop, we usually care about throughput—the number of records we can process per second, or the total time it takes to run a job on a dataset of a certain size.
In online systems, what’s usually more important is the service’s response time—that is, the time between a client sending a request and receiving a response
Usually it is better to use percentiles. In order to figure out how bad your outliers are, you can look at higher percentiles: the 95th, 99th, and 99.9th percentiles are common (abbreviated p95, p99, and p999). They are the response time thresholds at which 95%, 99%, or 99.9% of requests are faster than that particular threshold. percentiles are often used in service level objectives (SLOs) and service level agreements (SLAs), contracts that define the expected performance and availa‐ bility of a service.
Approaches for Coping with Load
People often talk of a dichotomy between scaling up (vertical scaling, moving to a more powerful machine) and scaling out (horizontal scaling, distributing the load across multiple smaller machines). Distributing load across multiple machines is also known as a shared-nothing architecture.
Some systems are elastic, meaning that they can automatically add computing resour‐ ces when they detect a load increase, whereas other systems are scaled manually (a human analyzes the capacity and decides to add more machines to the system). An elastic system can be useful if load is highly unpredictable, but manually scaled sys‐ tems are simpler and may have fewer operational surprises
Over time, many different people will work on the system (engineering and oper‐ ations, both maintaining current behavior and adapting the system to new use cases), and they should all be able to work on it productively.
It is well known that the majority of the cost of software is not in its initial develop‐ ment, but in its ongoing maintenance—fixing bugs, keeping its systems operational, investigating failures, adapting it to new platforms, modifying it for new use cases, repaying technical debt, and adding new features.
three design principles for software systems:
Operability, Making Life Easy for Operations
Make it easy for operations teams to keep the system running smoothly. Data systems can do various things to make routine tasks easy, including:
- Providing visibility into the runtime behavior and internals of the system, with good monitoring
- Providing good support for automation and integration with standard tools
- Avoiding dependency on individual machines (allowing machines to be taken down for maintenance while the system as a whole continues running uninter‐ rupted)
- Providing good documentation and an easy-to-understand operational model (“If I do X, Y will happen”)
- Providing good default behavior, but also giving administrators the freedom to override defaults when needed
- Self-healing where appropriate, but also giving administrators manual control over the system state when needed
- Exhibiting predictable behavior, minimizing surprises
Simplicity, Managing Complexity
Make it easy for new engineers to understand the system, by removing as much complexity as possible from the system.
One of the best tools we have for removing accidental complexity is abstraction. A good abstraction can hide a great deal of implementation detail behind a clean, simple-to-understand façade.
Evolvability, Making Change Easy
Make it easy for engineers to make changes to the system in the future, adapting it for unanticipated use cases as requirements change. Also known as extensibility, modifiability, or plasticity.
CHAPTER 2 Data Models and Query Languages
The JSON representation has better locality than the multi-table schema
Document databases are sometimes called schemaless, but that’s misleading, as the code that reads the data usually assumes some kind of structure—i.e., there is an implicit schema, but it is not enforced by the database.
A more accurate term is schema-on-read (the structure of the data is implicit, and only interpreted when the data is read), in contrast with schema-on-write (the traditional approach of relational databases, where the schema is explicit and the database ensures all written data conforms to it).
The difference between the approaches is particularly noticeable in situations where an application wants to change the format of its data. In a document database, you would just start writing new documents with the new fields and have code in the application that handles the case when old documents are read. In a relation database, a migration would be the typical practice when schema changes, which may be slow and requires downtime.
Query Languages for Data
When the relational model was introduced, it included a new way of querying data: SQL is a declarative query language, whereas IMS and CODASYL queried the data‐ base using imperative code.
A declarative query language is attractive because it is typically more concise and eas‐ ier to work with than an imperative API. But more importantly, it also hides imple‐ mentation details of the database engine, which makes it possible for the database system to introduce performance improvements without requiring any changes to queries.
Graph-Like Data Models
A graph consists of two kinds of objects: vertices (also known as nodes or entities) and edges (also known as relationships or arcs). You can think of a graph store as consisting of two relational tables, one for vertices and one for edges
The triple-store model is mostly equivalent to the property graph model, using differ‐ ent words to describe the same ideas. In a triple-store, all information is stored in the form of very simple three-part state‐ ments: (subject, predicate, object). For example, in the triple (Jim, likes, bananas), Jim is the subject, likes is the predicate (verb), and bananas is the object.
SPARQL is a query language for triple-stores using the RDF data model. It predates Cypher, and since Cypher’s pattern matching is borrowed from SPARQL
Historically, data started out being represented as one big tree (the hierarchical model), but that wasn’t good for representing many-to-many relationships, so the relational model was invented to solve that problem. More recently, developers found that some applications don’t fit well in the relational model either. New nonrelational “NoSQL” datastores have diverged in two main directions:
- Document databases target use cases where data comes in self-contained docu‐ ments and relationships between one document and another are rare.
- Graph databases go in the opposite direction, targeting use cases where anything is potentially related to everything.
All three models (document, relational, and graph) are widely used today, and each is good in its respective domain.
One thing that document and graph databases have in common is that they typically don’t enforce a schema for the data they store, which can make it easier to adapt applications to changing requirements.