Skip to main content

Designing Data-Intensive Applications

by Martin Kleppmann

Chapter 1: Reliable, Scalable, and Maintainable Applications

  • Many applications are data-intensive and not compute intensive.

Thinking About Data Systems

  • This book focuses on three main concerns: reliability, scalability, and maintainability.

Reliability

  • Reliability means the system continues to work correctly, even when things go wrong.

Hardware Faults

  • TODO

Software Errors

  • TODO

Human Errors

  • TODO

Scalability

  • Scalability is the term we use to describe a system's ability to cope with increased load.

Describing Load

  • TODO

Describing Performance

  • Look at what happens when load increases.
    • When you increase a load parameter and keep system resources fixed, how is the performance of your system affected?
    • When you increase a load parameter, how much do you need to increase the resources if you want to keep performance unchanged?
  • Response time what the client sees.
  • Latency.
  • Tail latencies are important because they directly affect users' experience of the service.
  • Even if you call multiple backend services in parallel, the end-user request must wait for the slowest of the parallel calls to complete.
  • The right way of aggregating response time data is to add the histograms.

Approaches for Coping with Load

  • Vertical scaling is moving to a more powerful machine. Horizontal scaling distributes the load across multiple smaller machines.
  • In an early-stage startup it's more important to be able to iterate quickly on product features than to scale beyond some hypothetical future load.

Maintainability

  • The majority of the cost of software is not in its initial development, but its ongoing maintenance.
  • Three important design principles for maintainability are:
    • Operability: Make it easy for the operations teams to keep the system running smoothly.
    • Simplicity: Make it easy for new engineers to understand the system by removing complexity.
    • Evolvability (or extensibility): Make it easy for engineers to adapt the system to unanticipated use cases as requirements change.

Operability: Making Life Easy for Operations

  • TODO

Simplicity: Managing Complexity

  • Making a system simpler does not necessarily mean reducing its functionality; it can also mean removing accidental complexity.
  • Accidental complexity is complexity that is not inherent in the problem that the software solves, but arises only from the implementation.
  • Good abstractions are one of the best tools for removing accidental complexity by hiding implementation details, but finding good abstractions is very hard.

Evolvability: Making Change Easy

  • Simple and easy-to-understand systems are usually easier to modify than complex ones.

Summary

  • Functional requirements are what an application should do.
  • Nonfunctional requirements are general properties like security, reliability, compliance, scalability, compatibility, and maintainability.

Chapter 2: Data Models and Query Languages

  • Data models are the most important part of developing software, because they affect not only how the software is written, but how we think about the problem that we're solving.
  • TODO

Relational Model Versus Document Model

The Birth of NoSQL

  • TODO

The Object-Relational Mismatch

  • Impedance mismatch is the disconnect between objects in the application code and the database model of tables, rows, and columns.
  • A JSON representation of a document-oriented database has better locality than an equivalent multi-table schema.

Many-to-One and Many-to-Many Relationships

  • TODO

Are Document Databases Repeating History?

The relational model
  • TODO
Comparison to document databases
  • TODO

Relational Versus Document Databases Today

  • TODO
Which data model leads to simpler application code?
  • TODO
Schema flexibility in the document model
  • TODO
Data locality for queries
  • TODO

Query Languages for Data

  • TODO

MapReduce Querying

  • TODO

Graph-Like Data Models

Property Graphs

  • TODO

Graph Queries in SQL

  • TODO

Triple-Stores and SPARQL

  • TODO

The Foundation: Datalog

  • TODO

Chapter 3: Storage and Retrieval

  • There is a big difference between storage engines optimized for transactional workloads and those optimized for analytics.

Data Structures That Power Your Database

  • Log is defined as an append-only sequence of records. It doesn't have to be human-readable, and instead might be binary.
  • Well-chosen indexes speed up read queries, but every index slows down writes, because the index also needs to be updated every time data is written.

Hash Indexes

  • TODO

SSTables and LSM-Trees

  • TODO
Constructing and maintaining SSTables
  • TODO
Making an LSM-tree out of SSTables
  • Storage engines based on the principle of merging and compacting sorted files are called LSM (Log-Structured Merge) storage engines.
Performance optimizations
  • TODO

B-Trees

  • B-trees remain the standard index implementation in almost all relational databases, and many non-relational databases use them too.
  • B-trees break down the database into fixed-size blocks or pages (usually 4KB) and read or write one page at a time. This maps closely to the underlying disk hardware.
  • TODO
Making B-trees reliable
  • TODO
B-tree optimizations
  • TODO

Comparing B-Trees and LSM-Trees

  • LSM-trees are typically faster for writes, whereas B-trees are thought to be faster for reads.
Advantages of LSM-trees
  • TODO
Downsides of LSM-trees
  • TODO

Other Indexing Structures

Storing values within the index
  • TODO
Multi-column indexes
  • TODO
Full-text search and fuzzy indexes
  • In Lucene, the in-memory index is a finite state automaton over the characters in the keys, similar to a trie.
  • This automaton can be transformed into a Levenshtein automaton, which supports efficient search for words within a given edit distance.
Keeping everything in memory
  • In-memory databases are faster because they avoid the overheads of encoding in-memory data structures in a form that can be written to disk.
  • In-memory databases are not faster because they don't need to read from disk – the operating system caches recently used disk blocks in memory anyway.

Transaction Processing or Analytics?

  • TODO

Data Warehousing

  • TODO
The divergence between OLTP databases and data warehouses
  • TODO

Stars and Snowflakes: Schemas for Analytics

  • TODO

Column-Oriented Storage

  • Although fact tables are often over 100 columns wide, a typical data warehouse query only accesses 4 or 5 of them at one time.
  • Column-oriented storage stores all the values from each column together, instead of all the values from each row together.
  • The column-oriented storage layout relies on each column file containing the rows in the same order.

Column Compression

  • TODO
Memory bandwidth and vectorized processing
  • TODO

Sort Order in Column Storage

  • TODO
Several different sort orders
  • TODO

Writing to Column-Oriented Storage

  • TODO

Aggregation: Data Cubes and Materialized Views

  • TODO

Summary

  • TODO

Chapter 4: Encoding and Evolution

  • TODO

Formats for Encoding Data

Language-Specific Formats

  • TODO

JSON, XML, and Binary Variants

  • TODO
Binary encoding
  • TODO

Thrift and Protocol Buffers

  • TODO
Field tags and schema evolution
  • TODO
Datatypes and schema evolution
  • TODO

Avro

  • TODO
The writer's schema and the reader's schema
  • TODO
Schema evolution rules
  • TODO
Dynamically generated schemas
  • TODO

The Merits of Schemas

  • TODO

Modes of Dataflow

Dataflow through Databases

  • TODO
Different values written at different times
  • TODO

Dataflow Through Services: REST and RPC

  • TODO
Web services
  • TODO
The problems with remote procedure calls (RPCs)
  • TODO
Current directions for RPC
  • TODO
Data encoding and evolution for RPC
  • TODO

Message-Passing Dataflow

  • TODO
Message brokers
  • TODO
Distributed actor frameworks
  • TODO

Chapter 5: Replication

  • Reasons to replicate data include reducing access latency by moving data geographically close to users, increasing availability, and increasing read throughput.
  • All the difficulty in replication lies in handling changes to replicated data.

Leaders and Followers

  • Leader-based replication, or active/passive replication, is the most common solution to ensuring that data is persisted to all replicas.
  • Whenever the leader writes new data to its local storage, it also sends the data change to each of its followers via a replication log or change stream.

Synchronous Versus Asynchronous Replication

  • TODO

Setting Up New Followers

  • TODO

Handling Node Outages

Follower failure: Catch-up recovery
  • The follower can connect to the leader and, via the replication log, consume all data changes that occurred while the follower was disconnected.
Leader failure: Failover
  • A failover requires promoting one follower as the new leader, reconfiguring clients to send data to the new leader, and reconfiguring other followers to consume data changes from the new leader.
  • TODO

Implementation of Replication Logs

Statement-based replication
  • Replicating every write request (statement) from a leader to its followers has many edge cases (e.g. non-determinism from NOW() or RAND()) and is not preferred.
Write-ahead log (WAL) shipping
  • The WAL details which bytes were changed in which blocks. It's thus a poor choice for a replication log, e.g. you cannot run different versions of the database on leaders and followers.
Logical (row-based) log replication
  • A logical log is a sequence of records describing writes to database tables with row granularity. This is the approach MySQL's binlog uses.
  • Logical logs are decoupled from storage engine internals and therefore backward compatible, and are also easier for external applications to parse.
Trigger-based replication
  • TODO

Problems with Replication Lag

  • TODO

Reading Your Own Writes

  • TODO

Monotonic Reads

  • TODO

Consistent Prefix Reads

  • TODO

Solutions for Replication Lag

  • TODO

Multi-Leader Replication

  • TODO

Use Cases for Multi-Leader Replication

Multi-datacenter operation
  • TODO
Clients with offline operation
  • TODO

Handling Write Conflicts

Synchronous versus asynchronous conflict detection
  • TODO
Converging toward a consistent state
  • TODO
Custom conflict resolution logic
  • Conflict-free replicated data types (CDRTs) are data structures (e.g. sets, maps, ordered lists, etc) that automatically resolve conflicts in sensible ways.
  • Mergeable persistent data structures track history explicitly and use a three-way merge function (similar to Git).
  • Operational transformations are for concurrent editing of an ordered list of items, and are used by Etherpad and Google Docs.

Multi-Leader Replication Topologies

  • TODO

Leaderless Replication

  • TODO

Writing to the Database When a Node is Down

  • TODO
Read repair and anti-entropy
  • TODO
Quorums for reading and writing
  • TODO

Limitations of Quorum Consistency

  • TODO
Monitoring staleness
  • TODO

Sloppy Quorums and Hinted Handoff

  • TODO

Detecting Concurrent Writes

  • TODO
Last write wins (discarding concurrent writes)
  • TODO
The "happens-before" relationship and concurrency
  • TODO
Capturing the happens-before relationship
  • TODO
Merging concurrently written values
  • TODO
Version vectors
  • TODO

Chapter 6: Partitioning

  • TODO

Partitioning and Replication

  • TODO

Partitioning of Key-Value Data

  • TODO

Partitioning by Key Range

  • TODO

Partitioning by Hash of Key

  • TODO

Skewed Workloads and Relieving Hot Spots

  • TODO

Partitioning and Secondary Indexes

  • TODO

Partitioning Secondary Indexes by Document

  • TODO

Partitioning Secondary Indexes by Term

  • TODO

Rebalancing Partitions

  • TODO

Strategies for Rebalancing

How not to do it: hash mod N
  • TODO
Fixed number of partitions
  • TODO
Dynamic partitioning
  • TODO
Partitioning proportionally to nodes
  • TODO

Operations: Automatic or Manual Rebalancing

  • TODO

Request Routing

  • TODO

Chapter 7: Transactions

  • TODO

The Slippery Concept of a Transaction

  • TODO

The Meaning of ACID

  • TODO
Atomicity
  • TODO
Consistency
  • TODO
Isolation
  • TODO
Durability
  • TODO

Single-Object and Multi-Object Operations

  • TODO
Single-object writes
  • TODO
The need for multi-object transactions
  • TODO
Handling errors and aborts
  • TODO

Weak Isolation Levels

  • TODO

Read Committed

  • TODO
No dirty reads
  • TODO
No dirty writes
  • TODO
Implementing read committed
  • TODO

Snapshot Isolation and Repeatable Read

  • TODO
Implementing snapshot isolation
  • TODO
Visibility rules for observing a consistent snapshot
  • TODO
Indexes and snapshot isolation
  • TODO
Repeatable read and naming confusion
  • TODO

Preventing Lost Updates

  • TODO
Atomic write operations
  • TODO
Explicit locking
  • TODO
Automatically detecting lost updates
  • TODO
Compare-and-set
  • TODO
Conflict resolution and replication
  • TODO

Write Skew and Phantoms

  • TODO
Phantoms causing write skew
  • TODO
Materializing conflicts
  • TODO

Serializability

  • TODO

Actual Serial Execution

  • TODO
Encapsulating transactions in stored procedures
  • TODO
Pros and cons of stored procedures
  • TODO
Partitioning
  • TODO

Two-Phase Locking (2PL)

  • TODO
Implementation of two-phase locking
  • TODO
Performance of two-phase locking
  • TODO
Predicate locks
  • TODO
Index-range locks
  • TODO

Serializable Snapshot Isolation (SSI)

  • TODO
Pessimistic versus optimistic concurrency control
  • TODO
Decisions based on an outdated premise
  • TODO
Detecting stale MVCC reads
  • TODO
Detecting writes that affect prior reads
  • TODO
Performance of serializable snapshot isolation
  • TODO

Summary

  • TODO

Chapter 8: The Trouble with Distributed Systems

Faults and Partial Failures

  • TODO

Cloud Computing and Supercomputing

  • TODO

Unreliable Networks

  • TODO

Network Faults in Practice

  • TODO

Detecting Faults

  • TODO

Timeouts and Unbounded Delays

  • TODO
Network congestion and queueing
  • TODO

Synchronous Versus Asynchronous Networks

  • TODO
Can we not simply make network delays predictable?
  • TODO

Unreliable Clocks

  • TODO

Monotonic Versus Time-of-Day Clocks

  • TODO
Time-of-day clocks
  • TODO
Monotonic clocks
  • TODO

Clock Synchronization and Accuracy

  • TODO

Relying on Synchronized Clocks

  • TODO
Timestamps for ordering events
  • TODO
Clock readings have a confidence interval
  • TODO
Synchronized clocks for global snapshots
  • TODO

Process Pauses

  • TODO
Response time guarantees
  • TODO
Limiting the impact of garbage collection
  • TODO

Knowledge, Truth, and Lies

  • TODO

The Truth is Defined by the Majority

  • TODO
The leader and the clock
  • TODO
Fencing tokens
  • TODO

Byzantine Faults

  • TODO

System Model and Reality

  • TODO
Correctness of an algorithm
  • TODO
Safety and liveness
  • TODO
Mapping system models to the real world
  • TODO

Summary

  • TODO

Chapter 9: Consistency and Consensus

  • TODO

Consistency Guarantees

  • TODO

Linearizability

  • TODO

What Makes a System Linearizable?

  • TODO

Relying on Linearizability

Locking and leader election
  • TODO
Constraints and uniqueness guarantees
  • TODO
Cross-channel timing dependencies
  • TODO

Implementing Linearizable Systems

  • TODO

The Cost of Linearizability

  • TODO
Linearizability and network delays
  • TODO

Ordering Guarantees

  • TODO

Ordering and Causality

  • TODO
The causal order is not a total order
  • TODO
Linearizability is stronger than causal consistency
  • TODO
Capturing causal dependencies
  • TODO

Sequence Number Ordering

  • TODO
Noncausal sequence number generators
  • TODO
Lamport timestamps
  • TODO
Timestamp ordering is not sufficient
  • TODO

Total Order Broadcast

  • TODO
Using total order broadcast
  • TODO
Implementing linearizable storage using total order broadcast
  • TODO
Implementing total order broadcast using linearizable storage
  • TODO

Distributed Transactions and Consensus

  • TODO

Atomic Commit and Two-Phase Commit (2PC)

  • TODO
From single-node to distributed atomic commit
  • TODO
Introduction to two-phase commit
  • TODO
A system of promises
  • TODO
Coordinator failure
  • TODO

Distributed Transactions in Practice

  • TODO
XA transactions
  • TODO
Holding locks while in doubt
  • TODO
Recovering from coordinator failure
  • TODO
Limitations of distributed transactions
  • TODO

Fault-Tolerant Consensus

  • TODO
Consensus algorithms and total order broadcast
  • TODO
Epoch numbering and quorums
  • TODO
Limitations of consensus
  • TODO

Membership and Coordination Services

  • TODO
Allocating work to nodes
  • TODO
Service discovery
  • TODO

Chapter 10: Batch Processing

  • TODO

Batch Processing with Unix Tools

Simple Log Analysis

Sorting versus in-memory aggregation
  • TODO

The Unix Philosophy

  • TODO
A uniform interface
  • TODO
Separation of logic and wiring
  • TODO
Transparency and experimentation
  • TODO

MapReduce and Distributed Filesystems

  • TODO

MapReduce Job Execution

  • TODO
Distributed execution of MapReduce
  • TODO
MapReduce workflows
  • TODO

Reduce-Side Joins and Grouping

  • TODO
Example: analysis of user activity events
  • TODO
Sort-merge joins
  • TODO
  • TODO
GROUP BY
  • TODO
Handling skew
  • TODO

Map-Side Joins

  • TODO
Broadcast hash joins
  • TODO
Partitioned hash joins
  • TODO
Map-side merge joins
  • TODO
MapReduce workflows with map-side joins
  • TODO

The Output of Batch Workflows

  • TODO
Key-value stores as batch process output
  • TODO
Philosophy of batch process outputs
  • TODO

Comparing Hadoop to Distributed Databases

  • TODO
Diversity of storage
  • TODO
Diversity of processing models
  • TODO
Designing for frequent faults
  • TODO

Beyond MapReduce

  • TODO

Materialization of Intermediate State

  • TODO
Dataflow engines
  • TODO
Fault tolerance
  • TODO
Discussion of materialization
  • TODO

Graphs and Iterative Processing

  • TODO
The Pregel processing model
  • TODO
Fault tolerance
  • TODO
Parallel execution
  • TODO

High Level APIs and Languages

  • TODO
The move toward declarative query languages
  • TODO
Specialization for different domains
  • TODO

Summary

  • TODO

Chapter 11: Stream Processing

  • TODO

Transmitting Event Streams

  • TODO

Messaging Systems

  • TODO
Direct messaging from producers to consumers
  • TODO
Message brokers
  • TODO
Message brokers compared to databases
  • TODO
Multiple consumers
  • TODO
Acknowledgments and redelivery
  • TODO

Partitioned Logs

  • TODO
Using logs for message storage
  • TODO
Logs compared to traditional messaging
  • TODO
Consumer offsets
  • TODO
Disk space usage
  • TODO
When consumers cannot keep up with producers
  • TODO

Databases and Streams

  • TODO

Keeping Streams in Sync

  • TODO

Change Data Capture

  • TODO
Implementing change data capture
  • TODO
Initial snapshot
  • TODO
Log compaction
  • TODO

Event Sourcing

  • TODO
Deriving current state from the event log
  • TODO
Commands and events
  • TODO

State, Streams, and Immutability

  • TODO
Advantages of immutable events
  • TODO
Deriving several views from the same event log
  • TODO
Concurrency control
  • TODO
Limitations to immutability
  • TODO

Processing Streams

  • TODO

Uses of Stream Processing

Complex event processing
  • TODO
Stream analytics
  • TODO
Search on streams
  • TODO

Reasoning About Time

  • TODO
Event time versus processing time
  • TODO
Knowing when you're ready
  • TODO
Whose clock are you using, anyway?
  • TODO
Types of windows
  • TODO

Stream Joins

  • TODO
Stream-stream join (window join)
  • TODO
Stream-table join (stream enrichment)
  • TODO
Table-table join (materialized view maintenance)
  • TODO
Time-dependence of joins
  • TODO

Fault Tolerance

  • TODO
Microbatching and checkpointing
  • TODO
Atomic commit revisited
  • TODO
Idempotence
  • TODO
Rebuilding state after a failure
  • TODO

Summary

  • TODO

Chapter 12: The Future of Data Systems

Data Integration

  • TODO

Combining Specialized Tools by Deriving Data

  • TODO
Reasoning about dataflows
  • TODO
Derived data versus distributed transactions
  • TODO
The limits of total ordering
  • TODO
Ordering events to capture causality
  • TODO

Batch and Stream Processing

  • TODO
Maintaining derived state
  • TODO
Reprocessing data for application evolution
  • TODO
The lambda architecture
  • TODO

Unbundling Databases

  • TODO

Composing Data Storage Technologies

The meta-database of everything
  • TODO
Making unbundling work
  • TODO
Unbundling versus integrated systems
  • TODO

Designing Applications Around Dataflow

  • TODO
Application code as a derivation function
  • TODO
Separation of application code and state
  • TODO
Dataflow: Interplay between state changes and application code
  • TODO
Stream processors and services
  • TODO

Observed Derived State

  • TODO
Materialized views and caching
  • TODO
Stateful, offline-capable clients
  • TODO
Pushing state changes to clients
  • TODO
End-to-end event streams
  • TODO
Reads are events too
  • TODO
Multi-partition data processing
  • TODO

Aiming for Correctness

  • TODO

The End-to-End Argument for Databases

Exactly-once execution of an operation
  • TODO
Duplicate suppression
  • TODO
Uniquely identifying requests
  • TODO
The end-to-end argument
  • TODO
Applying end-to-end thinking in data systems
  • TODO

Enforcing Constraints

Uniqueness constraints require consensus
  • TODO
Uniqueness in log-based messaging
  • TODO
Multi-partition request processing
  • TODO

Timeliness and Integrity

  • TODO
Correctness of dataflow systems
  • TODO
Loosely interpreted constraints
  • TODO
Coordinating-avoiding data systems
  • TODO

Trust, but Verify

  • TODO
Maintaining integrity in the face of software bugs
  • TODO
Don't blindly trust what they promise
  • TODO
Designing for auditability
  • TODO
The end-to-end argument again
  • TODO
Tools for auditable data systems
  • TODO

Doing the Right Thing

  • TODO

Predictive Analytics

  • TODO
Bits and discrimination
  • TODO
Responsibility and accountability
  • TODO
Feedback loops
  • TODO

Privacy and Tracking

  • TODO
Surveillance
  • TODO
  • TODO
Privacy and use of data
  • TODO
Data as assets and power
  • TODO
Remembering the Industrial Revolution
  • TODO