Index
A
- aborts (transactions), Transactions, Atomicity
- abstraction, Simplicity: Managing Complexity, Data Models and Query Languages, Transactions, Summary, Consistency and Consensus
- access path (in network model), The network model, The SPARQL query language
- accidental complexity, removing, Simplicity: Managing Complexity
- accountability, Responsibility and accountability
- ACID properties (transactions), Transaction Processing or Analytics?, The Meaning of ACID
- acknowledgements (messaging), Acknowledgments and redelivery
- active/active replication (see multi-leader replication)
- active/passive replication (see leader-based replication)
- ActiveMQ (messaging), Message brokers, Message brokers compared to databases
- ActiveRecord (object-relational mapper), The Object-Relational Mismatch, Handling errors and aborts
- actor model, Distributed actor frameworks
- Advanced Message Queuing Protocol (see AMQP)
- aerospace systems, Reliability, Human Errors, Byzantine Faults, Membership services
- aggregation
- aggregation pipeline query language, MapReduce Querying
- Agile, Evolvability: Making Change Easy
- agreement, Fault-Tolerant Consensus
- Airflow (workflow scheduler), MapReduce workflows
- Ajax, Dataflow Through Services: REST and RPC
- Akka (actor framework), Distributed actor frameworks
- algorithms
- all-to-all replication topologies, Multi-Leader Replication Topologies
- AllegroGraph (database), Graph-Like Data Models
- ALTER TABLE statement (SQL), Schema flexibility in the document model, Encoding and Evolution
- Amazon
- Amazon Web Services (AWS), Hardware Faults
- amplification
- AMQP (Advanced Message Queuing Protocol), Message brokers compared to databases
- analytics, Transaction Processing or Analytics?
- anti-caching (in-memory databases), Keeping everything in memory
- anti-entropy, Read repair and anti-entropy
- Apache ActiveMQ (see ActiveMQ)
- Apache Avro (see Avro)
- Apache Beam (see Beam)
- Apache BookKeeper (see BookKeeper)
- Apache Cassandra (see Cassandra)
- Apache CouchDB (see CouchDB)
- Apache Curator (see Curator)
- Apache Drill (see Drill)
- Apache Flink (see Flink)
- Apache Giraph (see Giraph)
- Apache Hadoop (see Hadoop)
- Apache HAWQ (see HAWQ)
- Apache HBase (see HBase)
- Apache Helix (see Helix)
- Apache Hive (see Hive)
- Apache Impala (see Impala)
- Apache Jena (see Jena)
- Apache Kafka (see Kafka)
- Apache Lucene (see Lucene)
- Apache MADlib (see MADlib)
- Apache Mahout (see Mahout)
- Apache Oozie (see Oozie)
- Apache Parquet (see Parquet)
- Apache Qpid (see Qpid)
- Apache Samza (see Samza)
- Apache Solr (see Solr)
- Apache Spark (see Spark)
- Apache Storm (see Storm)
- Apache Tajo (see Tajo)
- Apache Tez (see Tez)
- Apache Thrift (see Thrift)
- Apache ZooKeeper (see ZooKeeper)
- Apama (stream analytics), Complex event processing
- append-only B-trees, B-tree optimizations, Indexes and snapshot isolation
- append-only files (see logs)
- Application Programming Interfaces (APIs), Thinking About Data Systems, Data Models and Query Languages
- application state (see state)
- approximate search (see similarity search)
- archival storage, data from databases, Archival storage
- arcs (see edges)
- arithmetic mean, Describing Performance
- ASCII text, Thrift and Protocol Buffers, A uniform interface
- ASN.1 (schema language), The Merits of Schemas
- asynchronous networks, Unreliable Networks, Glossary
- asynchronous replication, Synchronous Versus Asynchronous Replication, Glossary
- Asynchronous Transfer Mode (ATM), Can we not simply make network delays predictable?
- atomic broadcast (see total order broadcast)
- atomic clocks (caesium clocks), Clock readings have a confidence interval, Synchronized clocks for global snapshots
- atomicity (concurrency), Glossary
- atomicity (transactions), Atomicity, Single-Object and Multi-Object Operations, Glossary
- auditability, Trust, but Verify-Tools for auditable data systems
- availability, Hardware Faults
- Avro (data format), Avro-Code generation and dynamically typed languages
- awk (Unix tool), Simple Log Analysis
- AWS (see Amazon Web Services)
- Azure (see Microsoft)
B
- B-trees (indexes), B-Trees-B-tree optimizations
- backpressure, Messaging Systems, Glossary
- backups
- backward compatibility, Encoding and Evolution
- BASE, contrast to ACID, The Meaning of ACID
- bash shell (Unix), Data Structures That Power Your Database, The Unix Philosophy, What’s missing?
- batch processing, Relational Model Versus Document Model, Batch Processing-Summary, Glossary
- combining with stream processing
- comparison to MPP databases, Comparing Hadoop to Distributed Databases-Designing for frequent faults
- comparison to stream processing, Processing Streams
- comparison to Unix, Philosophy of batch process outputs-Philosophy of batch process outputs
- dataflow engines, Dataflow engines-Discussion of materialization
- fault tolerance, Bringing related data together in the same place, Philosophy of batch process outputs, Fault tolerance, Messaging Systems
- for data integration, Batch and Stream Processing-Unifying batch and stream processing
- graphs and iterative processing, Graphs and Iterative Processing-Parallel execution
- high-level APIs and languages, MapReduce workflows, High-Level APIs and Languages-Specialization for different domains
- log-based messaging and, Replaying old messages
- maintaining derived state, Maintaining derived state
- MapReduce and distributed filesystems, MapReduce and Distributed Filesystems-Key-value stores as batch process output
- measuring performance, Describing Performance, Batch Processing
- outputs, The Output of Batch Workflows-Key-value stores as batch process output
- using Unix tools (example), Batch Processing with Unix Tools-Sorting versus in-memory aggregation
- Bayou (database), Uniqueness in log-based messaging
- Beam (dataflow library), Unifying batch and stream processing
- bias, Bias and discrimination
- big ball of mud, Simplicity: Managing Complexity
- Bigtable data model, Data locality for queries, Column Compression
- binary data encodings, Binary encoding-The Merits of Schemas
- binary encoding
- binary strings, lack of support in JSON and XML, JSON, XML, and Binary Variants
- BinaryProtocol encoding (Thrift), Thrift and Protocol Buffers
- Bitcask (storage engine), Hash Indexes
- Bitcoin (cryptocurrency), Tools for auditable data systems
- bitmap indexes, Column Compression
- blockchains, Tools for auditable data systems
- blocking atomic commit, Three-phase commit
- Bloom (programming language), Designing Applications Around Dataflow
- Bloom filter (algorithm), Performance optimizations, Stream analytics
- BookKeeper (replicated log), Allocating work to nodes
- Bottled Water (change data capture), Implementing change data capture
- bounded datasets, Summary, Stream Processing, Glossary
- (see also batch processing)
- bounded delays, Glossary
- broadcast hash joins, Broadcast hash joins
- brokerless messaging, Direct messaging from producers to consumers
- Brubeck (metrics aggregator), Direct messaging from producers to consumers
- BTM (transaction coordinator), Introduction to two-phase commit
- bulk synchronous parallel (BSP) model, The Pregel processing model
- bursty network traffic patterns, Can we not simply make network delays predictable?
- business data processing, Relational Model Versus Document Model, Transaction Processing or Analytics?, Batch Processing
- byte sequence, encoding data in, Formats for Encoding Data
- Byzantine faults, Byzantine Faults-Weak forms of lying, System Model and Reality, Glossary
C
- caches, Keeping everything in memory, Glossary
- and materialized views, Aggregation: Data Cubes and Materialized Views
- as derived data, Derived Data, Composing Data Storage Technologies-What’s missing?
- database as cache of transaction log, State, Streams, and Immutability
- in CPUs, Memory bandwidth and vectorized processing, Linearizability and network delays, The move toward declarative query languages
- invalidation and maintenance, Keeping Systems in Sync, Maintaining materialized views
- linearizability, Linearizability
- CAP theorem, The CAP theorem-The CAP theorem, Glossary
- Cascading (batch processing), Beyond MapReduce, High-Level APIs and Languages
- cascading failures, Software Errors, Operations: Automatic or Manual Rebalancing, Timeouts and Unbounded Delays
- Cascalog (batch processing), The Foundation: Datalog
- Cassandra (database)
- column-family data model, Data locality for queries, Column Compression
- compaction strategy, Performance optimizations
- compound primary key, Partitioning by Hash of Key
- gossip protocol, Request Routing
- hash partitioning, Partitioning by Hash of Key-Partitioning by Hash of Key
- last-write-wins conflict resolution, Last write wins (discarding concurrent writes), Timestamps for ordering events
- leaderless replication, Leaderless Replication
- linearizability, lack of, Linearizability and quorums
- log-structured storage, Making an LSM-tree out of SSTables
- multi-datacenter support, Multi-datacenter operation
- partitioning scheme, Partitioning proportionally to nodes
- secondary indexes, Partitioning Secondary Indexes by Document
- sloppy quorums, Sloppy Quorums and Hinted Handoff
- cat (Unix tool), Simple Log Analysis
- causal context, Version vectors
- (see also causal dependencies)
- causal dependencies, The “happens-before” relationship and concurrency-Version vectors
- causality, Glossary
- causal ordering, Ordering and Causality-Capturing causal dependencies
- consistency with, Sequence Number Ordering-Lamport timestamps
- consistent snapshots, Ordering and Causality
- happens-before relationship, The “happens-before” relationship and concurrency
- in serializable transactions, Decisions based on an outdated premise-Detecting writes that affect prior reads
- mismatch with clocks, Timestamps for ordering events
- ordering events to capture, Ordering events to capture causality
- violations of, Consistent Prefix Reads, Multi-Leader Replication Topologies, Timestamps for ordering events, Ordering and Causality
- with synchronized clocks, Synchronized clocks for global snapshots
- CEP (see complex event processing)
- certificate transparency, Tools for auditable data systems
- chain replication, Synchronous Versus Asynchronous Replication
- change data capture, Logical (row-based) log replication, Change Data Capture
- changelogs, State, Streams, and Immutability
- Chaos Monkey, Reliability, Network Faults in Practice
- checkpointing
- chronicle data model, Event Sourcing
- circuit-switched networks, Synchronous Versus Asynchronous Networks
- circular buffers, Disk space usage
- circular replication topologies, Multi-Leader Replication Topologies
- clickstream data, analysis of, Example: analysis of user activity events
- clients
- clocks, Unreliable Clocks-Limiting the impact of garbage collection
- atomic (caesium) clocks, Clock readings have a confidence interval, Synchronized clocks for global snapshots
- confidence interval, Clock readings have a confidence interval-Synchronized clocks for global snapshots
- for global snapshots, Synchronized clocks for global snapshots
- logical (see logical clocks)
- skew, Relying on Synchronized Clocks-Clock readings have a confidence interval, Implementing Linearizable Systems
- slewing, Monotonic clocks
- synchronization and accuracy, Clock Synchronization and Accuracy-Clock Synchronization and Accuracy
- synchronization using GPS, Unreliable Clocks, Clock Synchronization and Accuracy, Clock readings have a confidence interval, Synchronized clocks for global snapshots
- time-of-day versus monotonic clocks, Monotonic Versus Time-of-Day Clocks
- timestamping events, Whose clock are you using, anyway?
- cloud computing, Distributed Data, Cloud Computing and Supercomputing
- Cloudera Impala (see Impala)
- clustered indexes, Storing values within the index
- CODASYL model, The network model
- code generation
- collaborative editing
- column families (Bigtable), Data locality for queries, Column Compression
- column-oriented storage, Column-Oriented Storage-Writing to Column-Oriented Storage
- column compression, Column Compression
- distinction between column families and, Column Compression
- in batch processors, The move toward declarative query languages
- Parquet, Column-Oriented Storage, Archival storage, Philosophy of batch process outputs
- sort order in, Sort Order in Column Storage-Several different sort orders
- vectorized processing, Memory bandwidth and vectorized processing, The move toward declarative query languages
- writing to, Writing to Column-Oriented Storage
- comma-separated values (see CSV)
- command query responsibility segregation (CQRS), Deriving several views from the same event log
- commands (event sourcing), Commands and events
- commits (transactions), Transactions
- commutative operations, Conflict resolution and replication
- compaction
- CompactProtocol encoding (Thrift), Thrift and Protocol Buffers
- compare-and-set operations, Compare-and-set, What Makes a System Linearizable?
- compatibility, Encoding and Evolution, Modes of Dataflow
- compensating transactions, From single-node to distributed atomic commit, Advantages of immutable events, Loosely interpreted constraints
- complex event processing (CEP), Complex event processing
- complexity
- composing data systems (see unbundling databases)
- compute-intensive applications, Reliable, Scalable, and Maintainable Applications, Cloud Computing and Supercomputing
- concatenated indexes, Multi-column indexes
- Concord (stream processor), Stream analytics
- concurrency
- actor programming model, Distributed actor frameworks, Message passing and RPC
- (see also message-passing)
- bugs from weak transaction isolation, Weak Isolation Levels
- conflict resolution, Handling Write Conflicts, Custom conflict resolution logic
- detecting concurrent writes, Detecting Concurrent Writes-Version vectors
- dual writes, problems with, Keeping Systems in Sync
- happens-before relationship, The “happens-before” relationship and concurrency
- in replicated systems, Problems with Replication Lag-Version vectors, Linearizability-Linearizability and network delays
- lost updates, Preventing Lost Updates
- multi-version concurrency control (MVCC), Implementing snapshot isolation
- optimistic concurrency control, Pessimistic versus optimistic concurrency control
- ordering of operations, What Makes a System Linearizable?, The causal order is not a total order
- reducing, through event logs, Implementing linearizable storage using total order broadcast, Concurrency control, Dataflow: Interplay between state changes and application code
- time and relativity, The “happens-before” relationship and concurrency
- transaction isolation, Isolation
- write skew (transaction isolation), Write Skew and Phantoms-Materializing conflicts
- conflict-free replicated datatypes (CRDTs), Custom conflict resolution logic
- conflicts
- conflict detection, Synchronous versus asynchronous conflict detection
- causal dependencies, The “happens-before” relationship and concurrency, Capturing causal dependencies
- in consensus algorithms, Epoch numbering and quorums
- in leaderless replication, Detecting Concurrent Writes
- in log-based systems, Implementing linearizable storage using total order broadcast, Uniqueness constraints require consensus
- in nonlinearizable systems, Capturing causal dependencies
- in serializable snapshot isolation (SSI), Detecting writes that affect prior reads
- in two-phase commit, A system of promises, Limitations of distributed transactions
- conflict resolution
- determining what is a conflict, What is a conflict?, Uniqueness in log-based messaging
- in multi-leader replication, Handling Write Conflicts-What is a conflict?
- lost updates, Preventing Lost Updates-Conflict resolution and replication
- materializing, Materializing conflicts
- relation to operation ordering, Ordering Guarantees
- write skew (transaction isolation), Write Skew and Phantoms-Materializing conflicts
- congestion (networks)
- consensus, Consistency and Consensus, Fault-Tolerant Consensus-Summary, Glossary
- algorithms, Consensus algorithms and total order broadcast-Epoch numbering and quorums
- cost of, Limitations of consensus
- distributed transactions, Distributed Transactions and Consensus-Summary
- impossibility of, Distributed Transactions and Consensus
- membership and coordination services, Membership and Coordination Services-Membership services
- relation to compare-and-set, Linearizability and quorums, Implementing linearizable storage using total order broadcast, Implementing total order broadcast using linearizable storage, Summary
- relation to replication, Synchronous Versus Asynchronous Replication, Using total order broadcast
- relation to uniqueness constraints, Uniqueness constraints require consensus
- consistency, Consistency, Timeliness and Integrity
- across different databases, Leader failure: Failover, Keeping Systems in Sync, Deriving several views from the same event log, Derived data versus distributed transactions
- causal, Ordering and Causality-Timestamp ordering is not sufficient, Ordering events to capture causality
- consistent prefix reads, Consistent Prefix Reads-Consistent Prefix Reads
- consistent snapshots, Setting Up New Followers, Snapshot Isolation and Repeatable Read-Repeatable read and naming confusion, Synchronized clocks for global snapshots, Initial snapshot, Creating an index
- crash recovery, Making B-trees reliable
- enforcing constraints (see constraints)
- eventual, Problems with Replication Lag, Consistency Guarantees
- (see also eventual consistency)
- in ACID transactions, Consistency, Maintaining integrity in the face of software bugs
- in CAP theorem, The CAP theorem
- linearizability, Linearizability-Linearizability and network delays
- meanings of, Consistency
- monotonic reads, Monotonic Reads-Monotonic Reads
- of secondary indexes, The need for multi-object transactions, Indexes and snapshot isolation, Atomic Commit and Two-Phase Commit (2PC), Reasoning about dataflows, Creating an index
- ordering guarantees, Ordering Guarantees-Implementing total order broadcast using linearizable storage
- read-after-write, Reading Your Own Writes-Reading Your Own Writes
- sequential, Implementing linearizable storage using total order broadcast
- strong (see linearizability)
- timeliness and integrity, Timeliness and Integrity
- using quorums, Limitations of Quorum Consistency, Linearizability and quorums
- consistent hashing, Partitioning by Hash of Key
- consistent prefix reads, Consistent Prefix Reads
- constraints (databases), Consistency, Characterizing write skew
- asynchronously checked, Loosely interpreted constraints
- coordination avoidance, Coordination-avoiding data systems
- ensuring idempotence, Operation identifiers
- in log-based systems, Enforcing Constraints-Multi-partition request processing
- in two-phase commit, From single-node to distributed atomic commit, A system of promises
- relation to consensus, Summary, Uniqueness constraints require consensus
- relation to event ordering, Timestamp ordering is not sufficient
- requiring linearizability, Constraints and uniqueness guarantees
- Consul (service discovery), Service discovery
- consumers (message streams), Message brokers, Transmitting Event Streams
- backpressure, Messaging Systems
- consumer offsets in logs, Consumer offsets
- failures, Acknowledgments and redelivery, Consumer offsets
- fan-out, Describing Load, Multiple consumers, Logs compared to traditional messaging
- load balancing, Multiple consumers, Logs compared to traditional messaging
- not keeping up with producers, Messaging Systems, Disk space usage, Making unbundling work
- context switches, Describing Performance, Process Pauses
- convergence (conflict resolution), Converging toward a consistent state-Custom conflict resolution logic, Consistency Guarantees
- coordination
- coordinator (in 2PC), Introduction to two-phase commit
- copy-on-write (B-trees), B-tree optimizations, Indexes and snapshot isolation
- CORBA (Common Object Request Broker Architecture), The problems with remote procedure calls (RPCs)
- correctness, Thinking About Data Systems
- auditability, Trust, but Verify-Tools for auditable data systems
- Byzantine fault tolerance, Byzantine Faults, Tools for auditable data systems
- dealing with partial failures, Faults and Partial Failures
- in log-based systems, Enforcing Constraints-Multi-partition request processing
- of algorithm within system model, Correctness of an algorithm
- of compensating transactions, From single-node to distributed atomic commit
- of consensus, Epoch numbering and quorums
- of derived data, The lambda architecture, Designing for auditability
- of immutable data, Advantages of immutable events
- of personal data, Responsibility and accountability, Privacy and use of data
- of time, Multi-Leader Replication Topologies, Clock Synchronization and Accuracy-Synchronized clocks for global snapshots
- of transactions, Consistency, Aiming for Correctness, Maintaining integrity in the face of software bugs
- timeliness and integrity, Timeliness and Integrity-Coordination-avoiding data systems
- corruption of data
- detecting, The end-to-end argument, Don’t just blindly trust what they promise-Tools for auditable data systems
- due to pathological memory access, Trust, but Verify
- due to radiation, Byzantine Faults
- due to split brain, Leader failure: Failover, The leader and the lock
- due to weak transaction isolation, Weak Isolation Levels
- formalization in consensus, Consensus algorithms and total order broadcast
- integrity as absence of, Timeliness and Integrity
- network packets, Weak forms of lying
- on disks, Durability
- preventing using write-ahead logs, Making B-trees reliable
- recovering from, Philosophy of batch process outputs, Advantages of immutable events
- Couchbase (database)
- CouchDB (database)
- covering indexes, Storing values within the index
- CPUs
- CRDTs (see conflict-free replicated datatypes)
- CREATE INDEX statement (SQL), Other Indexing Structures, Creating an index
- credit rating agencies, Responsibility and accountability
- Crunch (batch processing), Beyond MapReduce, High-Level APIs and Languages
- cryptography
- CSS (Cascading Style Sheets), Declarative Queries on the Web
- CSV (comma-separated values), Data Structures That Power Your Database, JSON, XML, and Binary Variants, A uniform interface
- Curator (ZooKeeper recipes), Locking and leader election, Allocating work to nodes
- curl (Unix tool), Current directions for RPC, Separation of logic and wiring
- cursor stability, Atomic write operations
- Cypher (query language), The Cypher Query Language
D
- data corruption (see corruption of data)
- data cubes, Aggregation: Data Cubes and Materialized Views
- data formats (see encoding)
- data integration, Data Integration-Unifying batch and stream processing, Summary
- data lakes, Diversity of storage
- data locality (see locality)
- data models, Data Models and Query Languages-Summary
- data protection regulations, Legislation and self-regulation
- data systems, Reliable, Scalable, and Maintainable Applications
- about, Thinking About Data Systems
- concerns when designing, Thinking About Data Systems
- future of, The Future of Data Systems-Summary
- heterogeneous, keeping in sync, Keeping Systems in Sync
- maintainability, Maintainability-Evolvability: Making Change Easy
- possible faults in, Transactions
- reliability, Reliability-How Important Is Reliability?
- scalability, Scalability-Approaches for Coping with Load
- unreliable clocks, Unreliable Clocks-Limiting the impact of garbage collection
- data warehousing, Data Warehousing-Stars and Snowflakes: Schemas for Analytics, Glossary
- data-intensive applications, Reliable, Scalable, and Maintainable Applications
- database triggers (see triggers)
- database-internal distributed transactions, Distributed Transactions in Practice, Limitations of distributed transactions, Atomic commit revisited
- databases
- datacenters
- dataflow, Modes of Dataflow-Distributed actor frameworks, Designing Applications Around Dataflow-Stream processors and services
- dataflow engines, Dataflow engines-Discussion of materialization
- Datalog (query language), The Foundation: Datalog-The Foundation: Datalog
- datatypes
- Datomic (database)
- deadlocks
- Debezium (change data capture), Implementing change data capture
- declarative languages, Query Languages for Data, Glossary
- delays
- deleting data, Limitations of immutability
- denormalization (data representation), Many-to-One and Many-to-Many Relationships, Glossary
- derived data, Derived Data, Stream Processing, Glossary
- from change data capture, Implementing change data capture
- in event sourcing, Deriving current state from the event log-Deriving current state from the event log
- maintaining derived state through logs, Databases and Streams-API support for change streams, State, Streams, and Immutability-Concurrency control
- observing, by subscribing to streams, End-to-end event streams
- outputs of batch and stream processing, Batch and Stream Processing
- through application code, Application code as a derivation function
- versus distributed transactions, Derived data versus distributed transactions
- deterministic operations, Pros and cons of stored procedures, Faults and Partial Failures, Glossary
- accidental nondeterminism, Fault tolerance
- and fault tolerance, Fault tolerance, Fault tolerance
- and idempotence, Idempotence, Reasoning about dataflows
- computing derived data, Maintaining derived state, Correctness of dataflow systems, Designing for auditability
- in state machine replication, Using total order broadcast, Databases and Streams, Deriving current state from the event log
- joins, Time-dependence of joins
- DevOps, The Unix Philosophy
- differential dataflow, What’s missing?
- dimension tables, Stars and Snowflakes: Schemas for Analytics
- dimensional modeling (see star schemas)
- directed acyclic graphs (DAGs), Graphs and Iterative Processing
- dirty reads (transaction isolation), No dirty reads
- dirty writes (transaction isolation), No dirty writes
- discrimination, Bias and discrimination
- disks (see hard disks)
- distributed actor frameworks, Distributed actor frameworks
- distributed filesystems, MapReduce and Distributed Filesystems-MapReduce and Distributed Filesystems
- distributed systems, The Trouble with Distributed Systems-Summary, Glossary
- Byzantine faults, Byzantine Faults-Weak forms of lying
- cloud versus supercomputing, Cloud Computing and Supercomputing
- detecting network faults, Detecting Faults
- faults and partial failures, Faults and Partial Failures-Cloud Computing and Supercomputing
- formalization of consensus, Fault-Tolerant Consensus
- impossibility results, The CAP theorem, Distributed Transactions and Consensus
- issues with failover, Leader failure: Failover
- limitations of distributed transactions, Limitations of distributed transactions
- multi-datacenter, Multi-datacenter operation, The Cost of Linearizability
- network problems, Unreliable Networks-Can we not simply make network delays predictable?
- quorums, relying on, The Truth Is Defined by the Majority
- reasons for using, Distributed Data, Replication
- synchronized clocks, relying on, Relying on Synchronized Clocks-Synchronized clocks for global snapshots
- system models, System Model and Reality-Mapping system models to the real world
- use of clocks and time, Unreliable Clocks
- distributed transactions (see transactions)
- Django (web framework), Handling errors and aborts
- DNS (Domain Name System), Request Routing, Service discovery
- Docker (container manager), Separation of application code and state
- document data model, The Object-Relational Mismatch-Convergence of document and relational databases
- document-partitioned indexes, Partitioning Secondary Indexes by Document, Summary, Building search indexes
- domain-driven design (DDD), Event Sourcing
- DRBD (Distributed Replicated Block Device), Leaders and Followers
- drift (clocks), Clock Synchronization and Accuracy
- Drill (query engine), The divergence between OLTP databases and data warehouses
- Druid (database), Deriving several views from the same event log
- Dryad (dataflow engine), Dataflow engines
- dual writes, problems with, Keeping Systems in Sync, Dataflow: Interplay between state changes and application code
- duplicates, suppression of, Duplicate suppression
- durability (transactions), Durability, Glossary
- duration (time), Unreliable Clocks
- dynamic partitioning, Dynamic partitioning
- dynamically typed languages
- Dynamo-style databases (see leaderless replication)
E
- edges (in graphs), Graph-Like Data Models, Reduce-Side Joins and Grouping
- edit distance (full-text search), Full-text search and fuzzy indexes
- effectively-once semantics, Fault Tolerance, Exactly-once execution of an operation
- elastic systems, Approaches for Coping with Load
- Elasticsearch (search server)
- ElephantDB (database), Key-value stores as batch process output
- Elm (programming language), Designing Applications Around Dataflow, End-to-end event streams
- encodings (data formats), Encoding and Evolution-The Merits of Schemas
- Avro, Avro-Code generation and dynamically typed languages
- binary variants of JSON and XML, Binary encoding
- compatibility, Encoding and Evolution
- defined, Formats for Encoding Data
- JSON, XML, and CSV, JSON, XML, and Binary Variants
- language-specific formats, Language-Specific Formats
- merits of schemas, The Merits of Schemas
- representations of data, Formats for Encoding Data
- Thrift and Protocol Buffers, Thrift and Protocol Buffers-Datatypes and schema evolution
- end-to-end argument, Cloud Computing and Supercomputing, The end-to-end argument-Applying end-to-end thinking in data systems
- enrichment (stream), Stream-table join (stream enrichment)
- Enterprise JavaBeans (EJB), The problems with remote procedure calls (RPCs)
- entities (see vertices)
- epoch (consensus algorithms), Epoch numbering and quorums
- epoch (Unix timestamps), Time-of-day clocks
- equi-joins, Reduce-Side Joins and Grouping
- erasure coding (error correction), MapReduce and Distributed Filesystems
- Erlang OTP (actor framework), Distributed actor frameworks
- error handling
- error-correcting codes, Cloud Computing and Supercomputing, MapReduce and Distributed Filesystems
- Esper (CEP engine), Complex event processing
- etcd (coordination service), Membership and Coordination Services-Membership services
- Ethereum (blockchain), Tools for auditable data systems
- Ethernet (networks), Cloud Computing and Supercomputing, Unreliable Networks, Can we not simply make network delays predictable?
- Etherpad (collaborative editor), Collaborative editing
- ethics, Doing the Right Thing-Legislation and self-regulation
- code of ethics and professional practice, Doing the Right Thing
- legislation and self-regulation, Legislation and self-regulation
- predictive analytics, Predictive Analytics-Feedback loops
- privacy and tracking, Privacy and Tracking-Legislation and self-regulation
- respect, dignity, and agency, Legislation and self-regulation, Summary
- unintended consequences, Doing the Right Thing, Feedback loops
- ETL (extract-transform-load), Data Warehousing, Example: analysis of user activity events, Keeping Systems in Sync, Glossary
- event sourcing, Event Sourcing-Commands and events
- commands and events, Commands and events
- comparison to change data capture, Event Sourcing
- comparison to lambda architecture, The lambda architecture
- deriving current state from event log, Deriving current state from the event log
- immutability and auditability, State, Streams, and Immutability, Designing for auditability
- large, reliable data systems, Operation identifiers, Correctness of dataflow systems
- Event Store (database), Event Sourcing
- event streams (see streams)
- events, Transmitting Event Streams
- deciding on total order of, The limits of total ordering
- deriving views from event log, Deriving several views from the same event log
- difference to commands, Commands and events
- event time versus processing time, Event time versus processing time, Microbatching and checkpointing, Unifying batch and stream processing
- immutable, advantages of, Advantages of immutable events, Designing for auditability
- ordering to capture causality, Ordering events to capture causality
- reads as, Reads are events too
- stragglers, Knowing when you’re ready, The lambda architecture
- timestamp of, in stream processing, Whose clock are you using, anyway?
- EventSource (browser API), Pushing state changes to clients
- eventual consistency, Replication, Problems with Replication Lag, Safety and liveness, Consistency Guarantees
- evolvability, Evolvability: Making Change Easy, Encoding and Evolution
- calling services, Data encoding and evolution for RPC
- graph-structured data, Property Graphs
- of databases, Schema flexibility in the document model, Dataflow Through Databases-Archival storage, Deriving several views from the same event log, Reprocessing data for application evolution
- of message-passing, Distributed actor frameworks
- reprocessing data, Reprocessing data for application evolution, Unifying batch and stream processing
- schema evolution in Avro, The writer’s schema and the reader’s schema
- schema evolution in Thrift and Protocol Buffers, Field tags and schema evolution
- schema-on-read, Schema flexibility in the document model, Encoding and Evolution, The Merits of Schemas
- exactly-once semantics, Exactly-once message processing, Fault Tolerance, Exactly-once execution of an operation
- exclusive mode (locks), Implementation of two-phase locking
- eXtended Architecture transactions (see XA transactions)
- extract-transform-load (see ETL)
F
- Facebook
- fact tables, Stars and Snowflakes: Schemas for Analytics
- failover, Leader failure: Failover, Glossary
- failures
- fan-out (messaging systems), Describing Load, Multiple consumers
- fault tolerance, Reliability-How Important Is Reliability?, Glossary
- abstractions for, Consistency and Consensus
- formalization in consensus, Fault-Tolerant Consensus-Limitations of consensus
- human fault tolerance, Philosophy of batch process outputs
- in batch processing, Bringing related data together in the same place, Philosophy of batch process outputs, Fault tolerance, Fault tolerance
- in log-based systems, Applying end-to-end thinking in data systems, Timeliness and Integrity-Correctness of dataflow systems
- in stream processing, Fault Tolerance-Rebuilding state after a failure
- of distributed transactions, XA transactions-Limitations of distributed transactions
- transaction atomicity, Atomicity, Atomic Commit and Two-Phase Commit (2PC)-Exactly-once message processing
- faults, Reliability
- Byzantine faults, Byzantine Faults-Weak forms of lying
- failures versus, Reliability
- handled by transactions, Transactions
- handling in supercomputers and cloud computing, Cloud Computing and Supercomputing
- hardware, Hardware Faults
- in batch processing versus distributed databases, Designing for frequent faults
- in distributed systems, Faults and Partial Failures-Cloud Computing and Supercomputing
- introducing deliberately, Reliability, Network Faults in Practice
- network faults, Network Faults in Practice-Detecting Faults
- software errors, Software Errors
- tolerating (see fault tolerance)
- federated databases, The meta-database of everything
- fence (CPU instruction), Linearizability and network delays
- fencing (preventing split brain), Leader failure: Failover, The leader and the lock-Fencing tokens
- Fibre Channel (networks), MapReduce and Distributed Filesystems
- field tags (Thrift and Protocol Buffers), Thrift and Protocol Buffers-Field tags and schema evolution
- file descriptors (Unix), A uniform interface
- financial data, Advantages of immutable events
- Firebase (database), API support for change streams
- Flink (processing framework), Dataflow engines-Discussion of materialization
- dataflow APIs, High-Level APIs and Languages
- fault tolerance, Fault tolerance, Microbatching and checkpointing, Rebuilding state after a failure
- Gelly API (graph processing), The Pregel processing model
- integration of batch and stream processing, Batch and Stream Processing, Unifying batch and stream processing
- machine learning, Specialization for different domains
- query optimizer, The move toward declarative query languages
- stream processing, Stream analytics
- flow control, Network congestion and queueing, Messaging Systems, Glossary
- FLP result (on consensus), Distributed Transactions and Consensus
- FlumeJava (dataflow library), MapReduce workflows, High-Level APIs and Languages
- followers, Leaders and Followers, Glossary
- (see also leader-based replication)
- foreign keys, Comparison to document databases, Reduce-Side Joins and Grouping
- forward compatibility, Encoding and Evolution
- forward decay (algorithm), Describing Performance
- Fossil (version control system), Limitations of immutability
- FoundationDB (database)
- fractal trees, B-tree optimizations
- full table scans, Reduce-Side Joins and Grouping
- full-text search, Glossary
- functional reactive programming (FRP), Designing Applications Around Dataflow
- functional requirements, Summary
- futures (asynchronous operations), Current directions for RPC
- fuzzy search (see similarity search)
G
- garbage collection
- genome analysis, Summary, Specialization for different domains
- geographically distributed datacenters, Distributed Data, Reading Your Own Writes, Unreliable Networks, The limits of total ordering
- geospatial indexes, Multi-column indexes
- Giraph (graph processing), The Pregel processing model
- Git (version control system), Custom conflict resolution logic, The causal order is not a total order, Limitations of immutability
- GitHub, postmortems, Leader failure: Failover, Leader failure: Failover, Mapping system models to the real world
- global indexes (see term-partitioned indexes)
- GlusterFS (distributed filesystem), MapReduce and Distributed Filesystems
- GNU Coreutils (Linux), Sorting versus in-memory aggregation
- GoldenGate (change data capture), Trigger-based replication, Multi-datacenter operation, Implementing change data capture
- Google
- Bigtable (database)
- Chubby (lock service), Membership and Coordination Services
- Cloud Dataflow (stream processor), Stream analytics, Atomic commit revisited, Unifying batch and stream processing
- Cloud Pub/Sub (messaging), Message brokers compared to databases, Using logs for message storage
- Docs (collaborative editor), Collaborative editing
- Dremel (query engine), The divergence between OLTP databases and data warehouses, Column-Oriented Storage
- FlumeJava (dataflow library), MapReduce workflows, High-Level APIs and Languages
- GFS (distributed file system), MapReduce and Distributed Filesystems
- gRPC (RPC framework), Current directions for RPC
- MapReduce (batch processing), Batch Processing
- Pregel (graph processing), The Pregel processing model
- Spanner (see Spanner)
- TrueTime (clock API), Clock readings have a confidence interval
- gossip protocol, Request Routing
- government use of data, Data as assets and power
- GPS (Global Positioning System)
- GraphChi (graph processing), Parallel execution
- graphs, Glossary
- Gremlin (graph query language), Graph-Like Data Models
- grep (Unix tool), Simple Log Analysis
- GROUP BY clause (SQL), GROUP BY
- grouping records in MapReduce, GROUP BY
H
- Hadoop (data infrastructure)
- happens-before relationship, Ordering and Causality
- hard disks
- hardware faults, Hardware Faults
- hash indexes, Hash Indexes-Hash Indexes
- hash partitioning, Partitioning by Hash of Key-Partitioning by Hash of Key, Summary
- HAWQ (database), Specialization for different domains
- HBase (database)
- bug due to lack of fencing, The leader and the lock
- bulk loading, Key-value stores as batch process output
- column-family data model, Data locality for queries, Column Compression
- dynamic partitioning, Dynamic partitioning
- key-range partitioning, Partitioning by Key Range
- log-structured storage, Making an LSM-tree out of SSTables
- request routing, Request Routing
- size-tiered compaction, Performance optimizations
- use of HDFS, Diversity of processing models
- use of ZooKeeper, Membership and Coordination Services
- HDFS (Hadoop Distributed File System), MapReduce and Distributed Filesystems-MapReduce and Distributed Filesystems
- HdrHistogram (numerical library), Describing Performance
- head (Unix tool), Simple Log Analysis
- head vertex (property graphs), Property Graphs
- head-of-line blocking, Describing Performance
- heap files (databases), Storing values within the index
- Helix (cluster manager), Request Routing
- heterogeneous distributed transactions, Distributed Transactions in Practice, Limitations of distributed transactions
- heuristic decisions (in 2PC), Recovering from coordinator failure
- Hibernate (object-relational mapper), The Object-Relational Mismatch
- hierarchical model, Are Document Databases Repeating History?
- high availability (see fault tolerance)
- high-frequency trading, Clock Synchronization and Accuracy, Limiting the impact of garbage collection
- high-performance computing (HPC), Cloud Computing and Supercomputing
- hinted handoff, Sloppy Quorums and Hinted Handoff
- histograms, Describing Performance
- Hive (query engine), Beyond MapReduce, High-Level APIs and Languages
- Hollerith machines, Batch Processing
- hopping windows (stream processing), Types of windows
- horizontal scaling (see scaling out)
- HornetQ (messaging), Message brokers, Message brokers compared to databases
- hot spots, Partitioning of Key-Value Data
- hot standbys (see leader-based replication)
- HTTP, use in APIs (see services)
- human errors, Human Errors, Network Faults in Practice, Philosophy of batch process outputs
- HyperDex (database), Multi-column indexes
- HyperLogLog (algorithm), Stream analytics
I
- I/O operations, waiting for, Process Pauses
- IBM
- idempotence, The problems with remote procedure calls (RPCs), Idempotence, Glossary
- immutability
- advantages of, Advantages of immutable events, Designing for auditability
- deriving state from event log, State, Streams, and Immutability-Limitations of immutability
- for crash recovery, Hash Indexes
- in B-trees, B-tree optimizations, Indexes and snapshot isolation
- in event sourcing, Event Sourcing
- inputs to Unix commands, Transparency and experimentation
- limitations of, Limitations of immutability
- Impala (query engine)
- impedance mismatch, The Object-Relational Mismatch
- imperative languages, Query Languages for Data
- in doubt (transaction status), Coordinator failure
- in-memory databases, Keeping everything in memory
- incidents
- cascading failures, Software Errors
- crashes due to leap seconds, Clock Synchronization and Accuracy
- data corruption and financial losses due to concurrency bugs, Weak Isolation Levels
- data corruption on hard disks, Durability
- data loss due to last-write-wins, Converging toward a consistent state, Timestamps for ordering events
- data on disks unreadable, Mapping system models to the real world
- deleted items reappearing, Custom conflict resolution logic
- disclosure of sensitive data due to primary key reuse, Leader failure: Failover
- errors in transaction serializability, Maintaining integrity in the face of software bugs
- gigabit network interface with 1 Kb/s throughput, Summary
- network faults, Network Faults in Practice
- network interface dropping only inbound packets, Network Faults in Practice
- network partitions and whole-datacenter failures, Faults and Partial Failures
- poor handling of network faults, Network Faults in Practice
- sending message to ex-partner, Ordering events to capture causality
- sharks biting undersea cables, Network Faults in Practice
- split brain due to 1-minute packet delay, Leader failure: Failover, Network Faults in Practice
- vibrations in server rack, Describing Performance
- violation of uniqueness constraint, Maintaining integrity in the face of software bugs
- indexes, Data Structures That Power Your Database, Glossary
- and snapshot isolation, Indexes and snapshot isolation
- as derived data, Derived Data, Composing Data Storage Technologies-What’s missing?
- B-trees, B-Trees-B-tree optimizations
- building in batch processes, Building search indexes
- clustered, Storing values within the index
- comparison of B-trees and LSM-trees, Comparing B-Trees and LSM-Trees-Downsides of LSM-trees
- concatenated, Multi-column indexes
- covering (with included columns), Storing values within the index
- creating, Creating an index
- full-text search, Full-text search and fuzzy indexes
- geospatial, Multi-column indexes
- hash, Hash Indexes-Hash Indexes
- index-range locking, Index-range locks
- multi-column, Multi-column indexes
- partitioning and secondary indexes, Partitioning and Secondary Indexes-Partitioning Secondary Indexes by Term, Summary
- secondary, Other Indexing Structures
- SSTables and LSM-trees, SSTables and LSM-Trees-Performance optimizations
- updating when data changes, Keeping Systems in Sync, Maintaining materialized views
- Industrial Revolution, Remembering the Industrial Revolution
- InfiniBand (networks), Can we not simply make network delays predictable?
- InfiniteGraph (database), Graph-Like Data Models
- InnoDB (storage engine)
- inside-out databases, Designing Applications Around Dataflow
- (see also unbundling databases)
- integrating different data systems (see data integration)
- integrity, Timeliness and Integrity
- Interface Definition Language (IDL), Thrift and Protocol Buffers, Avro
- intermediate state, materialization of, Materialization of Intermediate State-Discussion of materialization
- internet services, systems for implementing, Cloud Computing and Supercomputing
- invariants, Consistency
- inversion of control, Separation of logic and wiring
- IP (Internet Protocol)
- ISDN (Integrated Services Digital Network), Synchronous Versus Asynchronous Networks
- isolation (in transactions), Isolation, Single-Object and Multi-Object Operations, Glossary
- iterative processing, Graphs and Iterative Processing-Parallel execution
J
- Java Database Connectivity (JDBC)
- Java Enterprise Edition (EE), The problems with remote procedure calls (RPCs), Introduction to two-phase commit, XA transactions
- Java Message Service (JMS), Message brokers compared to databases
- Java Transaction API (JTA), Introduction to two-phase commit, XA transactions
- Java Virtual Machine (JVM)
- JavaScript
- Jena (RDF framework), The RDF data model
- Jepsen (fault tolerance testing), Aiming for Correctness
- jitter (network delay), Network congestion and queueing
- joins, Glossary
- JOTM (transaction coordinator), Introduction to two-phase commit
- JSON
- Juttle (query language), Designing Applications Around Dataflow
K
- k-nearest neighbors, Specialization for different domains
- Kafka (messaging), Message brokers, Using logs for message storage
- Kafka Connect (database integration), API support for change streams, Deriving several views from the same event log
- Kafka Streams (stream processor), Stream analytics, Maintaining materialized views
- leader-based replication, Leaders and Followers
- log compaction, Log compaction, Maintaining materialized views
- message offsets, Using logs for message storage, Idempotence
- request routing, Request Routing
- transaction support, Atomic commit revisited
- usage example, Thinking About Data Systems
- Ketama (partitioning library), Partitioning proportionally to nodes
- key-value stores, Data Structures That Power Your Database
- Kryo (Java), Language-Specific Formats
- Kubernetes (cluster manager), Designing for frequent faults, Separation of application code and state
L
- lambda architecture, The lambda architecture
- Lamport timestamps, Lamport timestamps
- Large Hadron Collider (LHC), Summary
- last write wins (LWW), Converging toward a consistent state, Implementing Linearizable Systems
- late binding, Separation of logic and wiring
- latency
- leader-based replication, Leaders and Followers-Trigger-based replication
- (see also replication)
- failover, Leader failure: Failover, The leader and the lock
- handling node outages, Handling Node Outages
- implementation of replication logs
- linearizability of operations, Implementing Linearizable Systems
- locking and leader election, Locking and leader election
- log sequence number, Setting Up New Followers, Consumer offsets
- read-scaling architecture, Problems with Replication Lag
- relation to consensus, Single-leader replication and consensus
- setting up new followers, Setting Up New Followers
- synchronous versus asynchronous, Synchronous Versus Asynchronous Replication-Synchronous Versus Asynchronous Replication
- leaderless replication, Leaderless Replication-Version vectors
- leap seconds, Software Errors, Clock Synchronization and Accuracy
- leases, Process Pauses
- ledgers, Advantages of immutable events
- legacy systems, maintenance of, Maintainability
- less (Unix tool), Transparency and experimentation
- LevelDB (storage engine), Making an LSM-tree out of SSTables
- leveled compaction, Performance optimizations
- Levenshtein automata, Full-text search and fuzzy indexes
- limping (partial failure), Summary
- linearizability, Linearizability-Linearizability and network delays, Glossary
- cost of, The Cost of Linearizability-Linearizability and network delays
- definition, What Makes a System Linearizable?-What Makes a System Linearizable?
- implementing with total order broadcast, Implementing linearizable storage using total order broadcast
- in ZooKeeper, Membership and Coordination Services
- of derived data systems, Derived data versus distributed transactions, Timeliness and Integrity
- of different replication methods, Implementing Linearizable Systems-Linearizability and quorums
- relying on, Relying on Linearizability-Cross-channel timing dependencies
- stronger than causal consistency, Linearizability is stronger than causal consistency
- using to implement total order broadcast, Implementing total order broadcast using linearizable storage
- versus serializability, What Makes a System Linearizable?
- LinkedIn
- Azkaban (workflow scheduler), MapReduce workflows
- Databus (change data capture), Trigger-based replication, Implementing change data capture
- Espresso (database), The Object-Relational Mismatch, But what is the writer’s schema?, Different values written at different times, Leaders and Followers, Request Routing
- Helix (cluster manager) (see Helix)
- profile (example), The Object-Relational Mismatch
- reference to company entity (example), Many-to-One and Many-to-Many Relationships
- Rest.li (RPC framework), Current directions for RPC
- Voldemort (database) (see Voldemort)
- Linux, leap second bug, Software Errors, Clock Synchronization and Accuracy
- liveness properties, Safety and liveness
- LMDB (storage engine), B-tree optimizations, Indexes and snapshot isolation
- load
- load balancing (messaging), Multiple consumers
- local indexes (see document-partitioned indexes)
- locality (data access), The Object-Relational Mismatch, Data locality for queries, Glossary
- in batch processing, Distributed execution of MapReduce, Example: analysis of user activity events, Dataflow engines
- in stateful clients, Clients with offline operation, Stateful, offline-capable clients
- in stream processing, Stream-table join (stream enrichment), Rebuilding state after a failure, Stream processors and services, Uniqueness in log-based messaging
- location transparency, The problems with remote procedure calls (RPCs)
- locks, Glossary
- deadlock, Implementation of two-phase locking
- distributed locking, The leader and the lock-Fencing tokens, Locking and leader election
- for transaction isolation
- in snapshot isolation, Implementing snapshot isolation
- in two-phase locking (2PL), Two-Phase Locking (2PL)-Index-range locks
- making operations atomic, Atomic write operations
- performance, Performance of two-phase locking
- preventing dirty writes, Implementing read committed
- preventing phantoms with index-range locks, Index-range locks, Detecting writes that affect prior reads
- read locks (shared mode), Implementing read committed, Implementation of two-phase locking
- shared mode and exclusive mode, Implementation of two-phase locking
- in two-phase commit (2PC)
- materializing conflicts with, Materializing conflicts
- preventing lost updates by explicit locking, Explicit locking
- log sequence number, Setting Up New Followers, Consumer offsets
- logic programming languages, Designing Applications Around Dataflow
- logical clocks, Timestamps for ordering events, Sequence Number Ordering, Ordering events to capture causality
- logical logs, Logical (row-based) log replication
- logs (data structure), Data Structures That Power Your Database, Glossary
- advantages of immutability, Advantages of immutable events
- compaction, Hash Indexes, Performance optimizations, Log compaction, State, Streams, and Immutability
- creating using total order broadcast, Using total order broadcast
- implementing uniqueness constraints, Uniqueness in log-based messaging
- log-based messaging, Partitioned Logs-Replaying old messages
- log-structured storage, Data Structures That Power Your Database-Performance optimizations
- log-structured merge tree (see LSM-trees)
- replication, Leaders and Followers, Implementation of Replication Logs-Trigger-based replication
- scalability limits, The limits of total ordering
- loose coupling, Separation of logic and wiring, Materialization of Intermediate State, Making unbundling work
- lost updates (see updates)
- LSM-trees (indexes), Making an LSM-tree out of SSTables-Performance optimizations
- Lucene (storage engine), Making an LSM-tree out of SSTables
- Luigi (workflow scheduler), MapReduce workflows
- LWW (see last write wins)
M
- machine learning
- MADlib (machine learning toolkit), Specialization for different domains
- magic scaling sauce, Approaches for Coping with Load
- Mahout (machine learning toolkit), Specialization for different domains
- maintainability, Maintainability-Evolvability: Making Change Easy, The Future of Data Systems
- many-to-many relationships
- many-to-one and many-to-many relationships, Many-to-One and Many-to-Many Relationships-Many-to-One and Many-to-Many Relationships
- many-to-one relationships, Many-to-One and Many-to-Many Relationships
- MapReduce (batch processing), Batch Processing, MapReduce Job Execution-MapReduce Job Execution
- accessing external services within job, Example: analysis of user activity events, Key-value stores as batch process output
- comparison to distributed databases
- comparison to stream processing, Processing Streams
- comparison to Unix, Philosophy of batch process outputs-Philosophy of batch process outputs
- disadvantages and limitations of, Beyond MapReduce
- fault tolerance, Bringing related data together in the same place, Philosophy of batch process outputs, Fault tolerance
- higher-level tools, MapReduce workflows, High-Level APIs and Languages
- implementation in Hadoop, Distributed execution of MapReduce-MapReduce workflows
- implementation in MongoDB, MapReduce Querying-MapReduce Querying
- machine learning, Specialization for different domains
- map-side processing, Map-Side Joins-MapReduce workflows with map-side joins
- mapper and reducer functions, MapReduce Job Execution
- materialization of intermediate state, Materialization of Intermediate State-Discussion of materialization
- output of batch workflows, The Output of Batch Workflows-Key-value stores as batch process output
- reduce-side processing, Reduce-Side Joins and Grouping-Handling skew
- workflows, MapReduce workflows
- marshalling (see encoding)
- massively parallel processing (MPP), Parallel Query Execution
- master-master replication (see multi-leader replication)
- master-slave replication (see leader-based replication)
- materialization, Glossary
- Maven (Java build tool), The move toward declarative query languages
- Maxwell (change data capture), Implementing change data capture
- mean, Describing Performance
- media monitoring, Search on streams
- median, Describing Performance
- meeting room booking (example), More examples of write skew, Predicate locks, Enforcing Constraints
- membership services, Membership services
- Memcached (caching server), Thinking About Data Systems, Keeping everything in memory
- memory
- memory barrier (CPU instruction), Linearizability and network delays
- MemSQL (database)
- memtable (in LSM-trees), Constructing and maintaining SSTables
- Mercurial (version control system), Limitations of immutability
- merge joins, MapReduce map-side, Map-side merge joins
- mergeable persistent data structures, Custom conflict resolution logic
- merging sorted files, SSTables and LSM-Trees, Distributed execution of MapReduce, Sort-merge joins
- Merkle trees, Tools for auditable data systems
- Mesos (cluster manager), Designing for frequent faults, Separation of application code and state
- message brokers (see messaging systems)
- message-passing, Message-Passing Dataflow-Distributed actor frameworks
- MessagePack (encoding format), Binary encoding
- messages
- messaging systems, Stream Processing-Replaying old messages
- Meteor (web framework), API support for change streams
- microbatching, Microbatching and checkpointing, Batch and Stream Processing
- microservices, Dataflow Through Services: REST and RPC
- Microsoft
- migrating (rewriting) data, Schema flexibility in the document model, Different values written at different times, Deriving several views from the same event log, Reprocessing data for application evolution
- modulus operator (%), How not to do it: hash mod N
- MongoDB (database)
- aggregation pipeline, MapReduce Querying
- atomic operations, Atomic write operations
- BSON, Data locality for queries
- document data model, The Object-Relational Mismatch
- hash partitioning (sharding), Partitioning by Hash of Key-Partitioning by Hash of Key
- key-range partitioning, Partitioning by Key Range
- lack of join support, Many-to-One and Many-to-Many Relationships, Convergence of document and relational databases
- leader-based replication, Leaders and Followers
- MapReduce support, MapReduce Querying, Distributed execution of MapReduce
- oplog parsing, Implementing change data capture, API support for change streams
- partition splitting, Dynamic partitioning
- request routing, Request Routing
- secondary indexes, Partitioning Secondary Indexes by Document
- Mongoriver (change data capture), Implementing change data capture
- monitoring, Human Errors, Operability: Making Life Easy for Operations
- monotonic clocks, Monotonic clocks
- monotonic reads, Monotonic Reads
- MPP (see massively parallel processing)
- MSMQ (messaging), XA transactions
- multi-column indexes, Multi-column indexes
- multi-leader replication, Multi-Leader Replication-Multi-Leader Replication Topologies
- multi-object transactions, Single-Object and Multi-Object Operations
- Multi-Paxos (total order broadcast), Consensus algorithms and total order broadcast
- multi-table index cluster tables (Oracle), Data locality for queries
- multi-tenancy, Network congestion and queueing
- multi-version concurrency control (MVCC), Implementing snapshot isolation, Summary
- mutual exclusion, Pessimistic versus optimistic concurrency control
- MySQL (database)
- binlog coordinates, Setting Up New Followers
- binlog parsing for change data capture, Implementing change data capture
- circular replication topology, Multi-Leader Replication Topologies
- consistent snapshots, Setting Up New Followers
- distributed transaction support, XA transactions
- InnoDB storage engine (see InnoDB)
- JSON support, The Object-Relational Mismatch, Convergence of document and relational databases
- leader-based replication, Leaders and Followers
- performance of XA transactions, Distributed Transactions in Practice
- row-based replication, Logical (row-based) log replication
- schema changes in, Schema flexibility in the document model
- snapshot isolation support, Repeatable read and naming confusion
- statement-based replication, Statement-based replication
- Tungsten Replicator (multi-leader replication), Multi-datacenter operation
N
- nanomsg (messaging library), Direct messaging from producers to consumers
- Narayana (transaction coordinator), Introduction to two-phase commit
- NATS (messaging), Message brokers
- near-real-time (nearline) processing, Batch Processing
- (see also stream processing)
- Neo4j (database)
- Nephele (dataflow engine), Dataflow engines
- netcat (Unix tool), Separation of logic and wiring
- Netflix Chaos Monkey, Reliability, Network Faults in Practice
- Network Attached Storage (NAS), Distributed Data, MapReduce and Distributed Filesystems
- network model, The network model
- Network Time Protocol (see NTP)
- networks
- next-key locking, Index-range locks
- nodes (in graphs) (see vertices)
- nodes (processes), Glossary
- noisy neighbors, Network congestion and queueing
- nonblocking atomic commit, Three-phase commit
- nondeterministic operations
- nonfunctional requirements, Summary
- nonrepeatable reads, Snapshot Isolation and Repeatable Read
- normalization (data representation), Many-to-One and Many-to-Many Relationships, Glossary
- NoSQL, The Birth of NoSQL, Unbundling Databases
- Notation3 (N3), Triple-Stores and SPARQL
- npm (package manager), The move toward declarative query languages
- NTP (Network Time Protocol), Unreliable Clocks
- numbers, in XML and JSON encodings, JSON, XML, and Binary Variants
O
- object-relational mapping (ORM) frameworks, The Object-Relational Mismatch
- object-relational mismatch, The Object-Relational Mismatch
- observer pattern, Separation of application code and state
- offline systems, Batch Processing
- offline-first applications, Stateful, offline-capable clients
- offsets
- OLAP (online analytic processing), Transaction Processing or Analytics?, Glossary
- OLTP (online transaction processing), Transaction Processing or Analytics?, Glossary
- one-to-many relationships, The Object-Relational Mismatch
- online systems, Batch Processing
- Oozie (workflow scheduler), MapReduce workflows
- OpenAPI (service definition format), Web services
- OpenStack
- operability, Operability: Making Life Easy for Operations
- operating systems versus databases, Unbundling Databases
- operation identifiers, Operation identifiers, Multi-partition request processing
- operational transformation, Custom conflict resolution logic
- operators, Dataflow engines
- optimistic concurrency control, Pessimistic versus optimistic concurrency control
- Oracle (database)
- distributed transaction support, XA transactions
- GoldenGate (change data capture), Trigger-based replication, Multi-datacenter operation, Implementing change data capture
- lack of serializability, Isolation
- leader-based replication, Leaders and Followers
- multi-table index cluster tables, Data locality for queries
- not preventing write skew, Characterizing write skew
- partitioned indexes, Partitioning Secondary Indexes by Term
- PL/SQL language, Pros and cons of stored procedures
- preventing lost updates, Automatically detecting lost updates
- read committed isolation, Implementing read committed
- Real Application Clusters (RAC), Locking and leader election
- recursive query support, Graph Queries in SQL
- snapshot isolation support, Snapshot Isolation and Repeatable Read, Repeatable read and naming confusion
- TimesTen (in-memory database), Keeping everything in memory
- WAL-based replication, Write-ahead log (WAL) shipping
- XML support, The Object-Relational Mismatch
- ordering, Ordering Guarantees-Implementing total order broadcast using linearizable storage
- Orleans (actor framework), Distributed actor frameworks
- outliers (response time), Describing Performance
- Oz (programming language), Designing Applications Around Dataflow
P
- package managers, The move toward declarative query languages, Separation of application code and state
- packet switching, Can we not simply make network delays predictable?
- packets
- PageRank (algorithm), Graph-Like Data Models, Graphs and Iterative Processing
- paging (see virtual memory)
- ParAccel (database), The divergence between OLTP databases and data warehouses
- parallel databases (see massively parallel processing)
- parallel execution
- Parquet (data format), Column-Oriented Storage, Archival storage
- partial failures, Faults and Partial Failures, Summary
- partial order, The causal order is not a total order
- partitioning, Partitioning-Summary, Glossary
- Paxos (consensus algorithm), Consensus algorithms and total order broadcast
- percentiles, Describing Performance, Glossary
- Percona XtraBackup (MySQL tool), Setting Up New Followers
- performance
- perpetual inconsistency, Timeliness and Integrity
- pessimistic concurrency control, Pessimistic versus optimistic concurrency control
- phantoms (transaction isolation), Phantoms causing write skew
- physical clocks (see clocks)
- pickle (Python), Language-Specific Formats
- Pig (dataflow language), Beyond MapReduce, High-Level APIs and Languages
- Pinball (workflow scheduler), MapReduce workflows
- pipelined execution, Discussion of materialization
- point in time, Unreliable Clocks
- polyglot persistence, The Birth of NoSQL
- polystores, The meta-database of everything
- PostgreSQL (database)
- BDR (multi-leader replication), Multi-datacenter operation
- Bottled Water (change data capture), Implementing change data capture
- Bucardo (trigger-based replication), Trigger-based replication, Custom conflict resolution logic
- distributed transaction support, XA transactions
- foreign data wrappers, The meta-database of everything
- full text search support, Combining Specialized Tools by Deriving Data
- leader-based replication, Leaders and Followers
- log sequence number, Setting Up New Followers
- MVCC implementation, Implementing snapshot isolation, Indexes and snapshot isolation
- PL/pgSQL language, Pros and cons of stored procedures
- PostGIS geospatial indexes, Multi-column indexes
- preventing lost updates, Automatically detecting lost updates
- preventing write skew, Characterizing write skew, Serializable Snapshot Isolation (SSI)
- read committed isolation, Implementing read committed
- recursive query support, Graph Queries in SQL
- representing graphs, Property Graphs
- serializable snapshot isolation (SSI), Serializable Snapshot Isolation (SSI)
- snapshot isolation support, Snapshot Isolation and Repeatable Read, Repeatable read and naming confusion
- WAL-based replication, Write-ahead log (WAL) shipping
- XML and JSON support, The Object-Relational Mismatch, Convergence of document and relational databases
- pre-splitting, Dynamic partitioning
- Precision Time Protocol (PTP), Clock Synchronization and Accuracy
- predicate locks, Predicate locks
- predictive analytics, Predictive Analytics-Feedback loops
- preemption
- Pregel processing model, The Pregel processing model
- primary keys, Other Indexing Structures, Glossary
- primary-secondary replication (see leader-based replication)
- privacy, Privacy and Tracking-Legislation and self-regulation
- probabilistic algorithms, Describing Performance, Stream analytics
- process pauses, Process Pauses-Limiting the impact of garbage collection
- processing time (of events), Reasoning About Time
- producers (message streams), Transmitting Event Streams
- programming languages
- Prolog (language), The Foundation: Datalog
- promises (asynchronous operations), Current directions for RPC
- property graphs, Property Graphs
- Protocol Buffers (data format), Thrift and Protocol Buffers-Datatypes and schema evolution
- provenance of data, Designing for auditability
- publish/subscribe model, Messaging Systems
- publishers (message streams), Transmitting Event Streams
- punch card tabulating machines, Batch Processing
- pure functions, MapReduce Querying
- putting computation near data, Distributed execution of MapReduce
Q
- Qpid (messaging), Message brokers compared to databases
- quality of service (QoS), Can we not simply make network delays predictable?
- Quantcast File System (distributed filesystem), MapReduce and Distributed Filesystems
- query languages, Query Languages for Data-MapReduce Querying
- query optimizers, The relational model, The move toward declarative query languages
- queueing delays (networks), Network congestion and queueing
- queues (messaging), Message brokers
- quorums, Quorums for reading and writing-Limitations of Quorum Consistency, Glossary
R
- R-trees (indexes), Multi-column indexes
- RabbitMQ (messaging), Message brokers, Message brokers compared to databases
- race conditions, Isolation
- Raft (consensus algorithm), Consensus algorithms and total order broadcast
- RAID (Redundant Array of Independent Disks), Hardware Faults, MapReduce and Distributed Filesystems
- railways, schema migration on, Reprocessing data for application evolution
- RAMCloud (in-memory storage), Keeping everything in memory
- ranking algorithms, Graphs and Iterative Processing
- RDF (Resource Description Framework), The semantic web
- RDMA (Remote Direct Memory Access), Cloud Computing and Supercomputing
- read committed isolation level, Read Committed-Implementing read committed
- read path (derived data), Observing Derived State
- read repair (leaderless replication), Read repair and anti-entropy
- read replicas (see leader-based replication)
- read skew (transaction isolation), Snapshot Isolation and Repeatable Read, Summary
- read-after-write consistency, Reading Your Own Writes, Timeliness and Integrity
- read-modify-write cycle, Preventing Lost Updates
- read-scaling architecture, Problems with Replication Lag
- reads as events, Reads are events too
- real-time
- rebalancing partitions, Rebalancing Partitions-Operations: Automatic or Manual Rebalancing, Glossary
- recency guarantee, Linearizability
- recommendation engines
- records, MapReduce Job Execution
- recursive common table expressions (SQL), Graph Queries in SQL
- redelivery (messaging), Acknowledgments and redelivery
- Redis (database)
- redundancy
- Reed–Solomon codes (error correction), MapReduce and Distributed Filesystems
- refactoring, Evolvability: Making Change Easy
- regions (partitioning), Partitioning
- register (data structure), What Makes a System Linearizable?
- relational data model, Relational Model Versus Document Model-Convergence of document and relational databases
- relational databases
- eventual consistency, Problems with Replication Lag
- history, Relational Model Versus Document Model
- leader-based replication, Leaders and Followers
- logical logs, Logical (row-based) log replication
- philosophy compared to Unix, Unbundling Databases, The meta-database of everything
- schema changes, Schema flexibility in the document model, Encoding and Evolution, Different values written at different times
- statement-based replication, Statement-based replication
- use of B-tree indexes, B-Trees
- relationships (see edges)
- reliability, Reliability-How Important Is Reliability?, The Future of Data Systems
- Remote Method Invocation (Java RMI), The problems with remote procedure calls (RPCs)
- remote procedure calls (RPCs), The problems with remote procedure calls (RPCs)-Data encoding and evolution for RPC
- repeatable reads (transaction isolation), Repeatable read and naming confusion
- replicas, Leaders and Followers
- replication, Replication-Summary, Glossary
- and durability, Durability
- chain replication, Synchronous Versus Asynchronous Replication
- conflict resolution and, Conflict resolution and replication
- consistency properties, Problems with Replication Lag-Solutions for Replication Lag
- in distributed filesystems, MapReduce and Distributed Filesystems
- leaderless, Leaderless Replication-Version vectors
- monitoring staleness, Monitoring staleness
- multi-leader, Multi-Leader Replication-Multi-Leader Replication Topologies
- partitioning and, Distributed Data, Partitioning and Replication
- reasons for using, Distributed Data, Replication
- single-leader, Leaders and Followers-Trigger-based replication
- state machine replication, Using total order broadcast, Databases and Streams
- using erasure coding, MapReduce and Distributed Filesystems
- with heterogeneous data systems, Keeping Systems in Sync
- replication logs (see logs)
- reprocessing data, Reprocessing data for application evolution, Unifying batch and stream processing
- request routing, Request Routing-Parallel Query Execution
- resilient systems, Reliability
- (see also fault tolerance)
- response time
- responsibility and accountability, Responsibility and accountability
- REST (Representational State Transfer), Web services
- RethinkDB (database)
- Riak (database)
- Bitcask storage engine, Hash Indexes
- CRDTs, Custom conflict resolution logic, Merging concurrently written values
- dotted version vectors, Version vectors
- gossip protocol, Request Routing
- hash partitioning, Partitioning by Hash of Key-Partitioning by Hash of Key, Fixed number of partitions
- last-write-wins conflict resolution, Last write wins (discarding concurrent writes)
- leaderless replication, Leaderless Replication
- LevelDB storage engine, Making an LSM-tree out of SSTables
- linearizability, lack of, Linearizability and quorums
- multi-datacenter support, Multi-datacenter operation
- preventing lost updates across replicas, Conflict resolution and replication
- rebalancing, Operations: Automatic or Manual Rebalancing
- search feature, Partitioning Secondary Indexes by Term
- secondary indexes, Partitioning Secondary Indexes by Document
- siblings (concurrently written values), Merging concurrently written values
- sloppy quorums, Sloppy Quorums and Hinted Handoff
- ring buffers, Disk space usage
- Ripple (cryptocurrency), Tools for auditable data systems
- rockets, Human Errors, Are Document Databases Repeating History?, Byzantine Faults
- RocksDB (storage engine), Making an LSM-tree out of SSTables
- rollbacks (transactions), Transactions
- rolling upgrades, Hardware Faults, Encoding and Evolution
- routing (see request routing)
- row-oriented storage, Column-Oriented Storage
- rowhammer (memory corruption), Trust, but Verify
- RPCs (see remote procedure calls)
- Rubygems (package manager), The move toward declarative query languages
- rules (Datalog), The Foundation: Datalog
S
- safety and liveness properties, Safety and liveness
- sagas (see compensating transactions)
- Samza (stream processor), Stream analytics, Maintaining materialized views
- sandboxes, Human Errors
- SAP HANA (database), The divergence between OLTP databases and data warehouses
- scalability, Scalability-Approaches for Coping with Load, The Future of Data Systems
- scaling out, Approaches for Coping with Load, Distributed Data
- (see also shared-nothing architecture)
- scaling up, Approaches for Coping with Load, Distributed Data
- scatter/gather approach, querying partitioned databases, Partitioning Secondary Indexes by Document
- SCD (slowly changing dimension), Time-dependence of joins
- schema-on-read, Schema flexibility in the document model
- schema-on-write, Schema flexibility in the document model
- schemaless databases (see schema-on-read)
- schemas, Glossary
- Avro, Avro-Code generation and dynamically typed languages
- dynamically generated, Dynamically generated schemas
- evolution of, Reprocessing data for application evolution
- flexibility in document model, Schema flexibility in the document model
- for analytics, Stars and Snowflakes: Schemas for Analytics-Stars and Snowflakes: Schemas for Analytics
- for JSON and XML, JSON, XML, and Binary Variants
- merits of, The Merits of Schemas
- schema migration on railways, Reprocessing data for application evolution
- Thrift and Protocol Buffers, Thrift and Protocol Buffers-Datatypes and schema evolution
- traditional approach to design, fallacy in, Deriving several views from the same event log
- searches
- secondaries (see leader-based replication)
- secondary indexes, Other Indexing Structures, Glossary
- secondary sorts, Sort-merge joins
- sed (Unix tool), Simple Log Analysis
- self-describing files, Code generation and dynamically typed languages
- self-joins, Summary
- self-validating systems, A culture of verification
- semantic web, The semantic web
- semi-synchronous replication, Synchronous Versus Asynchronous Replication
- sequence number ordering, Sequence Number Ordering-Timestamp ordering is not sufficient
- sequential consistency, Implementing linearizable storage using total order broadcast
- serializability, Isolation, Weak Isolation Levels, Serializability-Performance of serializable snapshot isolation, Glossary
- Serializable (Java), Language-Specific Formats
- serialization, Formats for Encoding Data
- service discovery, Current directions for RPC, Request Routing, Service discovery
- service level agreements (SLAs), Describing Performance
- service-oriented architecture (SOA), Dataflow Through Services: REST and RPC
- services, Dataflow Through Services: REST and RPC-Data encoding and evolution for RPC
- session windows (stream processing), Types of windows
- sessionization, GROUP BY
- sharding (see partitioning)
- shared mode (locks), Implementation of two-phase locking
- shared-disk architecture, Distributed Data, MapReduce and Distributed Filesystems
- shared-memory architecture, Distributed Data
- shared-nothing architecture, Approaches for Coping with Load, Distributed Data-Distributed Data, Glossary
- sharks
- shredding (in relational model), Which data model leads to simpler application code?
- siblings (concurrent values), Merging concurrently written values, Conflict resolution and replication
- similarity search
- single-leader replication (see leader-based replication)
- single-threaded execution, Atomic write operations, Actual Serial Execution
- size-tiered compaction, Performance optimizations
- skew, Glossary
- slaves (see leader-based replication)
- sliding windows (stream processing), Types of windows
- sloppy quorums, Sloppy Quorums and Hinted Handoff
- slowly changing dimension (data warehouses), Time-dependence of joins
- smearing (leap seconds adjustments), Clock Synchronization and Accuracy
- snapshots (databases)
- snowflake schemas, Stars and Snowflakes: Schemas for Analytics
- SOAP, Web services
- software bugs, Software Errors
- solid state drives (SSDs)
- Solr (search server)
- sort (Unix tool), Simple Log Analysis, Sorting versus in-memory aggregation, The Unix Philosophy
- sort-merge joins (MapReduce), Sort-merge joins
- Sorted String Tables (see SSTables)
- sorting
- source of truth (see systems of record)
- Spanner (database)
- Spark (processing framework), Dataflow engines-Discussion of materialization
- SPARQL (query language), The SPARQL query language
- spatial algorithms, Specialization for different domains
- split brain, Leader failure: Failover, Glossary
- spreadsheets, dataflow programming capabilities, Designing Applications Around Dataflow
- SQL (Structured Query Language), Simplicity: Managing Complexity, Relational Model Versus Document Model, Query Languages for Data
- advantages and limitations of, Diversity of processing models
- distributed query execution, MapReduce Querying
- graph queries in, Graph Queries in SQL
- isolation levels standard, issues with, Repeatable read and naming confusion
- query execution on Hadoop, Diversity of processing models
- résumé (example), The Object-Relational Mismatch
- SQL injection vulnerability, Byzantine Faults
- SQL on Hadoop, The divergence between OLTP databases and data warehouses
- statement-based replication, Statement-based replication
- stored procedures, Pros and cons of stored procedures
- SQL Server (database)
- data warehousing support, The divergence between OLTP databases and data warehouses
- distributed transaction support, XA transactions
- leader-based replication, Leaders and Followers
- preventing lost updates, Automatically detecting lost updates
- preventing write skew, Characterizing write skew, Implementation of two-phase locking
- read committed isolation, Implementing read committed
- recursive query support, Graph Queries in SQL
- serializable isolation, Implementation of two-phase locking
- snapshot isolation support, Snapshot Isolation and Repeatable Read
- T-SQL language, Pros and cons of stored procedures
- XML support, The Object-Relational Mismatch
- SQLstream (stream analytics), Complex event processing
- SSDs (see solid state drives)
- SSTables (storage format), SSTables and LSM-Trees-Performance optimizations
- staleness (old data), Reading Your Own Writes
- standbys (see leader-based replication)
- star replication topologies, Multi-Leader Replication Topologies
- star schemas, Stars and Snowflakes: Schemas for Analytics-Stars and Snowflakes: Schemas for Analytics
- Star Wars analogy (event time versus processing time), Event time versus processing time
- state
- derived from log of immutable events, State, Streams, and Immutability
- deriving current state from the event log, Deriving current state from the event log
- interplay between state changes and application code, Dataflow: Interplay between state changes and application code
- maintaining derived state, Maintaining derived state
- maintenance by stream processor in stream-stream joins, Stream-stream join (window join)
- observing derived state, Observing Derived State-Multi-partition data processing
- rebuilding after stream processor failure, Rebuilding state after a failure
- separation of application code and, Separation of application code and state
- state machine replication, Using total order broadcast, Databases and Streams
- statement-based replication, Statement-based replication
- statically typed languages
- statistical and numerical algorithms, Specialization for different domains
- StatsD (metrics aggregator), Direct messaging from producers to consumers
- stdin, stdout, A uniform interface, Separation of logic and wiring
- Stellar (cryptocurrency), Tools for auditable data systems
- stock market feeds, Direct messaging from producers to consumers
- STONITH (Shoot The Other Node In The Head), Leader failure: Failover
- stop-the-world (see garbage collection)
- storage
- Storage Area Network (SAN), Distributed Data, MapReduce and Distributed Filesystems
- storage engines, Storage and Retrieval-Summary
- stored procedures, Trigger-based replication, Encapsulating transactions in stored procedures-Pros and cons of stored procedures, Glossary
- Storm (stream processor), Stream analytics
- straggler events, Knowing when you’re ready, The lambda architecture
- stream processing, Processing Streams-Summary, Glossary
- accessing external services within job, Stream-table join (stream enrichment), Microbatching and checkpointing, Idempotence, Exactly-once execution of an operation
- combining with batch processing
- comparison to batch processing, Processing Streams
- complex event processing (CEP), Complex event processing
- fault tolerance, Fault Tolerance-Rebuilding state after a failure
- for data integration, Batch and Stream Processing-Unifying batch and stream processing
- maintaining derived state, Maintaining derived state
- maintenance of materialized views, Maintaining materialized views
- messaging systems (see messaging systems)
- reasoning about time, Reasoning About Time-Types of windows
- relation to databases (see streams)
- relation to services, Stream processors and services
- search on streams, Search on streams
- single-threaded execution, Logs compared to traditional messaging, Concurrency control
- stream analytics, Stream analytics
- stream joins, Stream Joins-Time-dependence of joins
- streams, Stream Processing-Replaying old messages
- end-to-end, pushing events to clients, End-to-end event streams
- messaging systems (see messaging systems)
- processing (see stream processing)
- relation to databases, Databases and Streams-Limitations of immutability
- (see also changelogs)
- API support for change streams, API support for change streams
- change data capture, Change Data Capture-API support for change streams
- derivative of state by time, State, Streams, and Immutability
- event sourcing, Event Sourcing-Commands and events
- keeping systems in sync, Keeping Systems in Sync-Keeping Systems in Sync
- philosophy of immutable events, State, Streams, and Immutability-Limitations of immutability
- topics, Transmitting Event Streams
- strict serializability, What Makes a System Linearizable?
- strong consistency (see linearizability)
- strong one-copy serializability, What Makes a System Linearizable?
- subjects, predicates, and objects (in triple-stores), Triple-Stores and SPARQL
- subscribers (message streams), Transmitting Event Streams
- supercomputers, Cloud Computing and Supercomputing
- surveillance, Surveillance
- Swagger (service definition format), Web services
- swapping to disk (see virtual memory)
- synchronous networks, Synchronous Versus Asynchronous Networks, Glossary
- synchronous replication, Synchronous Versus Asynchronous Replication, Glossary
- system models, Knowledge, Truth, and Lies, System Model and Reality-Mapping system models to the real world
- systems of record, Derived Data, Glossary
- systems thinking, Feedback loops
T
- t-digest (algorithm), Describing Performance
- table-table joins, Table-table join (materialized view maintenance)
- Tableau (data visualization software), Diversity of processing models
- tail (Unix tool), Using logs for message storage
- tail vertex (property graphs), Property Graphs
- Tajo (query engine), The divergence between OLTP databases and data warehouses
- Tandem NonStop SQL (database), Partitioning
- TCP (Transmission Control Protocol), Cloud Computing and Supercomputing
- comparison to circuit switching, Can we not simply make network delays predictable?
- comparison to UDP, Network congestion and queueing
- connection failures, Detecting Faults
- flow control, Network congestion and queueing, Messaging Systems
- packet checksums, Weak forms of lying, The end-to-end argument, Trust, but Verify
- reliability and duplicate suppression, Duplicate suppression
- retransmission timeouts, Network congestion and queueing
- use for transaction sessions, Single-Object and Multi-Object Operations
- telemetry (see monitoring)
- Teradata (database), The divergence between OLTP databases and data warehouses, Partitioning
- term-partitioned indexes, Partitioning Secondary Indexes by Term, Summary
- termination (consensus), Fault-Tolerant Consensus
- Terrapin (database), Key-value stores as batch process output
- Tez (dataflow engine), Dataflow engines-Discussion of materialization
- thrashing (out of memory), Process Pauses
- threads (concurrency)
- three-phase commit, Three-phase commit
- Thrift (data format), Thrift and Protocol Buffers-Datatypes and schema evolution
- throughput, Describing Performance, Batch Processing
- TIBCO, Message brokers
- time
- time-of-day clocks, Time-of-day clocks
- timeliness, Timeliness and Integrity
- timeouts, Unreliable Networks, Glossary
- timestamps, Sequence Number Ordering
- assigning to events in stream processing, Whose clock are you using, anyway?
- for read-after-write consistency, Reading Your Own Writes
- for transaction ordering, Synchronized clocks for global snapshots
- insufficiency for enforcing constraints, Timestamp ordering is not sufficient
- key range partitioning by, Partitioning by Key Range
- Lamport, Lamport timestamps
- logical, Ordering events to capture causality
- ordering events, Timestamps for ordering events, Noncausal sequence number generators
- Titan (database), Graph-Like Data Models
- tombstones, Hash Indexes, Merging concurrently written values, Log compaction
- topics (messaging), Message brokers, Transmitting Event Streams
- total order, The causal order is not a total order, Glossary
- total order broadcast, Total Order Broadcast-Implementing total order broadcast using linearizable storage, The limits of total ordering, Uniqueness in log-based messaging
- tracking behavioral data, Privacy and Tracking
- transaction coordinator (see coordinator)
- transaction manager (see coordinator)
- transaction processing, Relational Model Versus Document Model, Transaction Processing or Analytics?-Stars and Snowflakes: Schemas for Analytics
- transactions, Transactions-Summary, Glossary
- ACID properties of, The Meaning of ACID
- compensating (see compensating transactions)
- concept of, The Slippery Concept of a Transaction
- distributed transactions, Distributed Transactions and Consensus-Limitations of distributed transactions
- avoiding, Derived data versus distributed transactions, Making unbundling work, Enforcing Constraints-Coordination-avoiding data systems
- failure amplification, Limitations of distributed transactions, Maintaining derived state
- in doubt/uncertain status, Coordinator failure, Holding locks while in doubt
- two-phase commit, Atomic Commit and Two-Phase Commit (2PC)-Three-phase commit
- use of, Distributed Transactions in Practice-Exactly-once message processing
- XA transactions, XA transactions-Limitations of distributed transactions
- OLTP versus analytics queries, The Output of Batch Workflows
- purpose of, Transactions
- serializability, Serializability-Performance of serializable snapshot isolation
- single-object and multi-object, Single-Object and Multi-Object Operations-Handling errors and aborts
- snapshot isolation (see snapshots)
- weak isolation levels, Weak Isolation Levels-Materializing conflicts
- transitive closure (graph algorithm), Graphs and Iterative Processing
- trie (data structure), Full-text search and fuzzy indexes
- triggers (databases), Trigger-based replication, Transmitting Event Streams
- triple-stores, Triple-Stores and SPARQL-The SPARQL query language
- tumbling windows (stream processing), Types of windows
- tuple spaces (programming model), Dataflow: Interplay between state changes and application code
- Turtle (RDF data format), Triple-Stores and SPARQL
- Twitter
- two-phase commit (2PC), Distributed Transactions and Consensus, Introduction to two-phase commit-Coordinator failure, Glossary
- two-phase locking (2PL), Two-Phase Locking (2PL)-Index-range locks, What Makes a System Linearizable?, Glossary
- type checking, dynamic versus static, Schema flexibility in the document model
U
- UDP (User Datagram Protocol)
- unbounded datasets, Stream Processing, Glossary
- unbounded delays, Glossary
- unbundling databases, Unbundling Databases-Multi-partition data processing
- uncertain (transaction status) (see in doubt)
- uniform consensus, Fault-Tolerant Consensus
- uniform interfaces, A uniform interface
- union type (in Avro), Schema evolution rules
- uniq (Unix tool), Simple Log Analysis
- uniqueness constraints
- Unix philosophy, The Unix Philosophy-Transparency and experimentation
- UPDATE statement (SQL), Schema flexibility in the document model
- updates
V
- validity (consensus), Fault-Tolerant Consensus
- vBuckets (partitioning), Partitioning
- vector clocks, Version vectors
- (see also version vectors)
- vectorized processing, Memory bandwidth and vectorized processing, The move toward declarative query languages
- verification, Trust, but Verify-Tools for auditable data systems
- version control systems, reliance on immutable data, Limitations of immutability
- version vectors, Multi-Leader Replication Topologies, Version vectors
- Vertica (database), The divergence between OLTP databases and data warehouses
- vertical scaling (see scaling up)
- vertices (in graphs), Graph-Like Data Models
- Viewstamped Replication (consensus algorithm), Consensus algorithms and total order broadcast
- virtual machines, Distributed Data
- virtual memory
- VisiCalc (spreadsheets), Designing Applications Around Dataflow
- vnodes (partitioning), Partitioning
- Voice over IP (VoIP), Network congestion and queueing
- Voldemort (database)
- VoltDB (database)
W
- WAL (write-ahead log), Making B-trees reliable
- web services (see services)
- Web Services Description Language (WSDL), Web services
- webhooks, Direct messaging from producers to consumers
- webMethods (messaging), Message brokers
- WebSocket (protocol), Pushing state changes to clients
- windows (stream processing), Stream analytics, Reasoning About Time-Types of windows
- winners (conflict resolution), Converging toward a consistent state
- WITH RECURSIVE syntax (SQL), Graph Queries in SQL
- workflows (MapReduce), MapReduce workflows
- working set, Sorting versus in-memory aggregation
- write amplification, Advantages of LSM-trees
- write path (derived data), Observing Derived State
- write skew (transaction isolation), Write Skew and Phantoms-Materializing conflicts
- write-ahead log (WAL), Making B-trees reliable, Write-ahead log (WAL) shipping
- writes (database)
- WS-* framework, Web services
- WS-AtomicTransaction (2PC), Introduction to two-phase commit
Z
- Zab (consensus algorithm), Consensus algorithms and total order broadcast
- ZeroMQ (messaging library), Direct messaging from producers to consumers
- ZooKeeper (coordination service), Membership and Coordination Services-Membership services
- generating fencing tokens, Fencing tokens, Using total order broadcast, Membership and Coordination Services
- linearizable operations, Implementing Linearizable Systems, Implementing linearizable storage using total order broadcast
- locks and leader election, Locking and leader election
- service discovery, Service discovery
- use for partition assignment, Request Routing, Allocating work to nodes
- use of Zab algorithm, Using total order broadcast, Distributed Transactions and Consensus, Consensus algorithms and total order broadcast