Skip to main content

Designing Data-Intensive Applications

by Martin Kleppmann

Chapter 1: Reliable, Scalable, and Maintainable Applications

Many applications are data-intensive and not compute intensive.

Thinking About Data Systems

This book focuses on three main concerns: reliability, scalability, and maintainability.

Reliability

Reliability means the system continues to work correctly, even when things go wrong.

Hardware Faults

TODO

Software Errors

TODO

Human Errors

TODO

Scalability

Scalability is the term we use to describe a system's ability to cope with increased load.

Describing Load

TODO

Describing Performance

Look at what happens when load increases.
- When you increase a load parameter and keep system resources fixed, how is the performance of your system affected?
- When you increase a load parameter, how much do you need to increase the resources if you want to keep performance unchanged?
Response time what the client sees.
Latency.
Tail latencies are important because they directly affect users' experience of the service.
Even if you call multiple backend services in parallel, the end-user request must wait for the slowest of the parallel calls to complete.
The right way of aggregating response time data is to add the histograms.

Approaches for Coping with Load

Vertical scaling is moving to a more powerful machine. Horizontal scaling distributes the load across multiple smaller machines.
In an early-stage startup it's more important to be able to iterate quickly on product features than to scale beyond some hypothetical future load.

Maintainability

The majority of the cost of software is not in its initial development, but its ongoing maintenance.
Three important design principles for maintainability are:
- Operability: Make it easy for the operations teams to keep the system running smoothly.
- Simplicity: Make it easy for new engineers to understand the system by removing complexity.
- Evolvability (or extensibility): Make it easy for engineers to adapt the system to unanticipated use cases as requirements change.

Operability: Making Life Easy for Operations

TODO

Simplicity: Managing Complexity

Making a system simpler does not necessarily mean reducing its functionality; it can also mean removing accidental complexity.
Accidental complexity is complexity that is not inherent in the problem that the software solves, but arises only from the implementation.
Good abstractions are one of the best tools for removing accidental complexity by hiding implementation details, but finding good abstractions is very hard.

Evolvability: Making Change Easy

Simple and easy-to-understand systems are usually easier to modify than complex ones.

Summary

Functional requirements are what an application should do.
Nonfunctional requirements are general properties like security, reliability, compliance, scalability, compatibility, and maintainability.

Chapter 2: Data Models and Query Languages

Data models are the most important part of developing software, because they affect not only how the software is written, but how we think about the problem that we're solving.
TODO

Relational Model Versus Document Model

The Birth of NoSQL

TODO

The Object-Relational Mismatch

Impedance mismatch is the disconnect between objects in the application code and the database model of tables, rows, and columns.
A JSON representation of a document-oriented database has better locality than an equivalent multi-table schema.

Many-to-One and Many-to-Many Relationships

TODO

Are Document Databases Repeating History?

The relational model

TODO

Comparison to document databases

TODO

Relational Versus Document Databases Today

TODO

Which data model leads to simpler application code?

TODO

Schema flexibility in the document model

TODO

Data locality for queries

TODO

Query Languages for Data

TODO

MapReduce Querying

TODO

Graph-Like Data Models

Property Graphs

TODO

Graph Queries in SQL

TODO

Triple-Stores and SPARQL

TODO

The Foundation: Datalog

TODO

Chapter 3: Storage and Retrieval

There is a big difference between storage engines optimized for transactional workloads and those optimized for analytics.

Data Structures That Power Your Database

Log is defined as an append-only sequence of records. It doesn't have to be human-readable, and instead might be binary.
Well-chosen indexes speed up read queries, but every index slows down writes, because the index also needs to be updated every time data is written.

Hash Indexes

TODO

SSTables and LSM-Trees

TODO

Constructing and maintaining SSTables

TODO

Making an LSM-tree out of SSTables

Storage engines based on the principle of merging and compacting sorted files are called LSM (Log-Structured Merge) storage engines.

Performance optimizations

TODO

B-Trees

B-trees remain the standard index implementation in almost all relational databases, and many non-relational databases use them too.
B-trees break down the database into fixed-size blocks or pages (usually 4KB) and read or write one page at a time. This maps closely to the underlying disk hardware.
TODO

Making B-trees reliable

TODO

B-tree optimizations

TODO

Comparing B-Trees and LSM-Trees

LSM-trees are typically faster for writes, whereas B-trees are thought to be faster for reads.

Advantages of LSM-trees

TODO

Downsides of LSM-trees

TODO

Other Indexing Structures

Storing values within the index

TODO

Multi-column indexes

TODO

Full-text search and fuzzy indexes

In Lucene, the in-memory index is a finite state automaton over the characters in the keys, similar to a trie.
This automaton can be transformed into a Levenshtein automaton, which supports efficient search for words within a given edit distance.

Keeping everything in memory

In-memory databases are faster because they avoid the overheads of encoding in-memory data structures in a form that can be written to disk.
In-memory databases are not faster because they don't need to read from disk – the operating system caches recently used disk blocks in memory anyway.

Transaction Processing or Analytics?

TODO

Data Warehousing

TODO

The divergence between OLTP databases and data warehouses

TODO

Stars and Snowflakes: Schemas for Analytics

TODO

Column-Oriented Storage

Although fact tables are often over 100 columns wide, a typical data warehouse query only accesses 4 or 5 of them at one time.
Column-oriented storage stores all the values from each column together, instead of all the values from each row together.
The column-oriented storage layout relies on each column file containing the rows in the same order.

Column Compression

TODO

Memory bandwidth and vectorized processing

TODO

Sort Order in Column Storage

TODO

Several different sort orders

TODO

Writing to Column-Oriented Storage

TODO

Aggregation: Data Cubes and Materialized Views

TODO

Summary

TODO

Chapter 4: Encoding and Evolution

TODO

Formats for Encoding Data

Language-Specific Formats

TODO

JSON, XML, and Binary Variants

TODO

Binary encoding

TODO

Thrift and Protocol Buffers

TODO

Field tags and schema evolution

TODO

Datatypes and schema evolution

TODO

Avro

TODO

The writer's schema and the reader's schema

TODO

Schema evolution rules

TODO

Dynamically generated schemas

TODO

The Merits of Schemas

TODO

Modes of Dataflow

Dataflow through Databases

TODO

Different values written at different times

TODO

Dataflow Through Services: REST and RPC

TODO

Web services

TODO

The problems with remote procedure calls (RPCs)

TODO

Current directions for RPC

TODO

Data encoding and evolution for RPC

TODO

Message-Passing Dataflow

TODO

Message brokers

TODO

Distributed actor frameworks

TODO

Chapter 5: Replication

Reasons to replicate data include reducing access latency by moving data geographically close to users, increasing availability, and increasing read throughput.
All the difficulty in replication lies in handling changes to replicated data.

Leaders and Followers

Leader-based replication, or active/passive replication, is the most common solution to ensuring that data is persisted to all replicas.
Whenever the leader writes new data to its local storage, it also sends the data change to each of its followers via a replication log or change stream.

Synchronous Versus Asynchronous Replication

TODO

Setting Up New Followers

TODO

Handling Node Outages

Follower failure: Catch-up recovery

The follower can connect to the leader and, via the replication log, consume all data changes that occurred while the follower was disconnected.

Leader failure: Failover

A failover requires promoting one follower as the new leader, reconfiguring clients to send data to the new leader, and reconfiguring other followers to consume data changes from the new leader.
TODO

Implementation of Replication Logs

Statement-based replication

Replicating every write request (statement) from a leader to its followers has many edge cases (e.g. non-determinism from NOW() or RAND()) and is not preferred.

Write-ahead log (WAL) shipping

The WAL details which bytes were changed in which blocks. It's thus a poor choice for a replication log, e.g. you cannot run different versions of the database on leaders and followers.

Logical (row-based) log replication

A logical log is a sequence of records describing writes to database tables with row granularity. This is the approach MySQL's binlog uses.
Logical logs are decoupled from storage engine internals and therefore backward compatible, and are also easier for external applications to parse.

Trigger-based replication

TODO

Problems with Replication Lag

TODO

Reading Your Own Writes

TODO

Monotonic Reads

TODO

Consistent Prefix Reads

TODO

Solutions for Replication Lag

TODO

Multi-Leader Replication

TODO

Use Cases for Multi-Leader Replication

Multi-datacenter operation

TODO

Clients with offline operation

TODO

Handling Write Conflicts

Synchronous versus asynchronous conflict detection

TODO

Converging toward a consistent state

TODO

Custom conflict resolution logic

Conflict-free replicated data types (CDRTs) are data structures (e.g. sets, maps, ordered lists, etc) that automatically resolve conflicts in sensible ways.
Mergeable persistent data structures track history explicitly and use a three-way merge function (similar to Git).
Operational transformations are for concurrent editing of an ordered list of items, and are used by Etherpad and Google Docs.

Multi-Leader Replication Topologies

TODO

Leaderless Replication

TODO

Writing to the Database When a Node is Down

TODO

Read repair and anti-entropy

TODO

Quorums for reading and writing

TODO

Limitations of Quorum Consistency

TODO

Monitoring staleness

TODO

Sloppy Quorums and Hinted Handoff

TODO

Detecting Concurrent Writes

TODO

Last write wins (discarding concurrent writes)

TODO

The "happens-before" relationship and concurrency

TODO

Capturing the happens-before relationship

TODO

Merging concurrently written values

TODO

Version vectors

TODO

Chapter 6: Partitioning

TODO

Partitioning and Replication

TODO

Partitioning of Key-Value Data

TODO

Partitioning by Key Range

TODO

Partitioning by Hash of Key

TODO

Skewed Workloads and Relieving Hot Spots

TODO

Partitioning and Secondary Indexes

TODO

Partitioning Secondary Indexes by Document

TODO

Partitioning Secondary Indexes by Term

TODO

Rebalancing Partitions

TODO

Strategies for Rebalancing

How not to do it: hash mod N

TODO

Fixed number of partitions

TODO

Dynamic partitioning

TODO

Partitioning proportionally to nodes

TODO

Operations: Automatic or Manual Rebalancing

TODO

Request Routing

TODO

Chapter 7: Transactions

TODO

The Slippery Concept of a Transaction

TODO

The Meaning of ACID

TODO

Atomicity

TODO

Consistency

TODO

Isolation

TODO

Durability

TODO

Single-Object and Multi-Object Operations

TODO

Single-object writes

TODO

The need for multi-object transactions

TODO

Handling errors and aborts

TODO

Weak Isolation Levels

TODO

Read Committed

TODO

No dirty reads

TODO

No dirty writes

TODO

Implementing read committed

TODO

Snapshot Isolation and Repeatable Read

TODO

Implementing snapshot isolation

TODO

Visibility rules for observing a consistent snapshot

TODO

Indexes and snapshot isolation

TODO

Repeatable read and naming confusion

TODO

Preventing Lost Updates

TODO

Atomic write operations

TODO

Explicit locking

TODO

Automatically detecting lost updates

TODO

Compare-and-set

TODO

Conflict resolution and replication

TODO

Write Skew and Phantoms

TODO

Phantoms causing write skew

TODO

Materializing conflicts

TODO

Serializability

TODO

Actual Serial Execution

TODO

Encapsulating transactions in stored procedures

TODO

Pros and cons of stored procedures

TODO

Partitioning

TODO

Two-Phase Locking (2PL)

TODO

Implementation of two-phase locking

TODO

Performance of two-phase locking

TODO

Predicate locks

TODO

Index-range locks

TODO

Serializable Snapshot Isolation (SSI)

TODO

Pessimistic versus optimistic concurrency control

TODO

Decisions based on an outdated premise

TODO

Detecting stale MVCC reads

TODO

Detecting writes that affect prior reads

TODO

Performance of serializable snapshot isolation

TODO

Summary

TODO

Chapter 8: The Trouble with Distributed Systems

Faults and Partial Failures

TODO

Cloud Computing and Supercomputing

TODO

Unreliable Networks

TODO

Network Faults in Practice

TODO

Detecting Faults

TODO

Timeouts and Unbounded Delays

TODO

Network congestion and queueing

TODO

Synchronous Versus Asynchronous Networks

TODO

Can we not simply make network delays predictable?

TODO

Unreliable Clocks

TODO

Monotonic Versus Time-of-Day Clocks

TODO

Time-of-day clocks

TODO

Monotonic clocks

TODO

Clock Synchronization and Accuracy

TODO

Relying on Synchronized Clocks

TODO

Timestamps for ordering events

TODO

Clock readings have a confidence interval

TODO

Synchronized clocks for global snapshots

TODO

Process Pauses

TODO

Response time guarantees

TODO

Limiting the impact of garbage collection

TODO

Knowledge, Truth, and Lies

TODO

The Truth is Defined by the Majority

TODO

The leader and the clock

TODO

Fencing tokens

TODO

Byzantine Faults

TODO

System Model and Reality

TODO

Correctness of an algorithm

TODO

Safety and liveness

TODO

Mapping system models to the real world

TODO

Summary

TODO

Chapter 9: Consistency and Consensus

TODO

Consistency Guarantees

TODO

Linearizability

TODO

What Makes a System Linearizable?

TODO

Relying on Linearizability

Locking and leader election

TODO

Constraints and uniqueness guarantees

TODO

Cross-channel timing dependencies

TODO

Implementing Linearizable Systems

TODO

The Cost of Linearizability

TODO

Linearizability and network delays

TODO

Ordering Guarantees

TODO

Ordering and Causality

TODO

The causal order is not a total order

TODO

Linearizability is stronger than causal consistency

TODO

Capturing causal dependencies

TODO

Sequence Number Ordering

TODO

Noncausal sequence number generators

TODO

Lamport timestamps

TODO

Timestamp ordering is not sufficient

TODO

Total Order Broadcast

TODO

Using total order broadcast

TODO

Implementing linearizable storage using total order broadcast

TODO

Implementing total order broadcast using linearizable storage

TODO

Distributed Transactions and Consensus

TODO

Atomic Commit and Two-Phase Commit (2PC)

TODO

From single-node to distributed atomic commit

TODO

Introduction to two-phase commit

TODO

A system of promises

TODO

Coordinator failure

TODO

Distributed Transactions in Practice

TODO

XA transactions

TODO

Holding locks while in doubt

TODO

Recovering from coordinator failure

TODO

Limitations of distributed transactions

TODO

Fault-Tolerant Consensus

TODO

Consensus algorithms and total order broadcast

TODO

Epoch numbering and quorums

TODO

Limitations of consensus

TODO

Membership and Coordination Services

TODO

Allocating work to nodes

TODO

Service discovery

TODO

Chapter 10: Batch Processing

TODO

Batch Processing with Unix Tools

Simple Log Analysis

Sorting versus in-memory aggregation

TODO

The Unix Philosophy

TODO

A uniform interface

TODO

Separation of logic and wiring

TODO

Transparency and experimentation

TODO

MapReduce and Distributed Filesystems

TODO

MapReduce Job Execution

TODO

Distributed execution of MapReduce

TODO

MapReduce workflows

TODO

Reduce-Side Joins and Grouping

TODO

Example: analysis of user activity events

TODO

Sort-merge joins

TODO

TODO

GROUP BY

TODO

Handling skew

TODO

Map-Side Joins

TODO

Broadcast hash joins

TODO

Partitioned hash joins

TODO

Map-side merge joins

TODO

MapReduce workflows with map-side joins

TODO

The Output of Batch Workflows

TODO

Key-value stores as batch process output

TODO

Philosophy of batch process outputs

TODO

Comparing Hadoop to Distributed Databases

TODO

Diversity of storage

TODO

Diversity of processing models

TODO

Designing for frequent faults

TODO

Beyond MapReduce

TODO

Materialization of Intermediate State

TODO

Dataflow engines

TODO

Fault tolerance

TODO

Discussion of materialization

TODO

Graphs and Iterative Processing

TODO

The Pregel processing model

TODO

Fault tolerance

TODO

Parallel execution

TODO

High Level APIs and Languages

TODO

The move toward declarative query languages

TODO

Specialization for different domains

TODO

Summary

TODO

Chapter 11: Stream Processing

TODO

Transmitting Event Streams

TODO

Messaging Systems

TODO

Direct messaging from producers to consumers

TODO

Message brokers

TODO

Message brokers compared to databases

TODO

Multiple consumers

TODO

Acknowledgments and redelivery

TODO

Partitioned Logs

TODO

Using logs for message storage

TODO

Logs compared to traditional messaging

TODO

Consumer offsets

TODO

Disk space usage

TODO

When consumers cannot keep up with producers

TODO

Databases and Streams

TODO

Keeping Streams in Sync

TODO

Change Data Capture

TODO

Implementing change data capture

TODO

Initial snapshot

TODO

Log compaction

TODO

Event Sourcing

TODO

Deriving current state from the event log

TODO

Commands and events

TODO

State, Streams, and Immutability

TODO

Advantages of immutable events

TODO

Deriving several views from the same event log

TODO

Concurrency control

TODO

Limitations to immutability

TODO

Processing Streams

TODO

Uses of Stream Processing

Complex event processing

TODO

Stream analytics

TODO

Search on streams

TODO

Reasoning About Time

TODO

Event time versus processing time

TODO

Knowing when you're ready

TODO

Whose clock are you using, anyway?

TODO

Types of windows

TODO

Stream Joins

TODO

Stream-stream join (window join)

TODO

Stream-table join (stream enrichment)

TODO

Table-table join (materialized view maintenance)

TODO

Time-dependence of joins

TODO

Fault Tolerance

TODO

Microbatching and checkpointing

TODO

Atomic commit revisited

TODO

Idempotence

TODO

Rebuilding state after a failure

TODO

Summary

TODO

Chapter 12: The Future of Data Systems

Data Integration

TODO

Combining Specialized Tools by Deriving Data

TODO

Reasoning about dataflows

TODO

Derived data versus distributed transactions

TODO

The limits of total ordering

TODO

Ordering events to capture causality

TODO

Batch and Stream Processing

TODO

Maintaining derived state

TODO

Reprocessing data for application evolution

TODO

The lambda architecture

TODO

Unbundling Databases

TODO

Composing Data Storage Technologies

The meta-database of everything

TODO

Making unbundling work

TODO

Unbundling versus integrated systems

TODO

Designing Applications Around Dataflow

TODO

Application code as a derivation function

TODO

Separation of application code and state

TODO

Dataflow: Interplay between state changes and application code

TODO

Stream processors and services

TODO

Observed Derived State

TODO

Materialized views and caching

TODO

Stateful, offline-capable clients

TODO

Pushing state changes to clients

TODO

End-to-end event streams

TODO

Reads are events too

TODO

Multi-partition data processing

TODO

Aiming for Correctness

TODO

The End-to-End Argument for Databases

Exactly-once execution of an operation

TODO

Duplicate suppression

TODO

Uniquely identifying requests

TODO

The end-to-end argument

TODO

Applying end-to-end thinking in data systems

TODO

Enforcing Constraints

Uniqueness constraints require consensus

TODO

Uniqueness in log-based messaging

TODO

Multi-partition request processing

TODO

Timeliness and Integrity

TODO

Correctness of dataflow systems

TODO

Loosely interpreted constraints

TODO

Coordinating-avoiding data systems

TODO

Trust, but Verify

TODO

Maintaining integrity in the face of software bugs

TODO

Don't blindly trust what they promise

TODO

Designing for auditability

TODO

The end-to-end argument again

TODO

Tools for auditable data systems

TODO

Doing the Right Thing

TODO

Predictive Analytics

TODO

Bits and discrimination

TODO

Responsibility and accountability

TODO

Feedback loops

TODO

Privacy and Tracking

TODO

Surveillance

TODO

TODO

Privacy and use of data

TODO

Data as assets and power

TODO

Remembering the Industrial Revolution

TODO

Chapter 1: Reliable, Scalable, and Maintainable Applications
Chapter 2: Data Models and Query Languages
Chapter 3: Storage and Retrieval
Chapter 4: Encoding and Evolution
- Formats for Encoding Data
- Modes of Dataflow
Chapter 5: Replication
Chapter 6: Partitioning
Chapter 7: Transactions
Chapter 8: The Trouble with Distributed Systems
Chapter 9: Consistency and Consensus
Chapter 10: Batch Processing
Chapter 11: Stream Processing
Chapter 12: The Future of Data Systems