Previous Chapter

About the Authors

Index

Symbols

/varz HTTP handler, Instrumentation of Applications

A

abusive client behavior, Dealing with Abusive Client Behavior
access control, Enforcement of Policies and Procedures
ACID datastore semantics, Managing Critical State: Distributed Consensus for Reliability, Choosing a Strategy for Superior Data Integrity
acknowledgments, Acknowledgments-Acknowledgments
adaptive throttling, Client-Side Throttling
Ads Database, Automate Yourself Out of a Job: Automate ALL the Things!-Automate Yourself Out of a Job: Automate ALL the Things!
AdSense, Other service metrics
aggregate availability equation, Measuring Service Risk, Availability Table
aggregation, Rule Evaluation, Aggregation
agility vs. stability, System Stability Versus Agility
- (see also software simplicity)
Alertmanager service, Alerting
alerts
- defined, Definitions
- false-positive, Tagging
- software for, Monitoring and Alerting
  - (see also Borgmon; time-series monitoring)
anacron, Reliability Perspective
Apache Mesos, Managing Machines
App Engine, Case Study
archives vs. backups, Backups Versus Archives
asynchronous distributed consensus, How Distributed Consensus Works
atomic broadcast systems, Reliable Distributed Queuing and Messaging
attribution policy, Using Code Examples
automation
- applying to cluster turnups, Soothing the Pain: Applying Automation to Cluster Turnups-Service-Oriented Cluster-Turnup
- vs. autonomous systems, The Evolution of Automation at Google
- benefits of, The Value of Automation-The Value for Google SRE
- best practices for change management, Change Management
- Borg example, Borg: Birth of the Warehouse-Scale Computer
- cross-industry lessons, Automating Away Repetitive Work and Operational Overhead
- database example, Automate Yourself Out of a Job: Automate ALL the Things!-Automate Yourself Out of a Job: Automate ALL the Things!
- Diskerase example, Recommendations
- focus on reliability, Reliability Is the Fundamental Feature
- Google's approach to, The Value for Google SRE
- hierarchy of automation classes, A Hierarchy of Automation Classes
- recommendations for enacting, Recommendations
- specialized application of, The Inclination to Specialize
- use cases for, The Use Cases for Automation-A Hierarchy of Automation Classes
automation tools, Testing Scalable Tools
autonomous systems, The Evolution of Automation at Google
Auxon case study, Auxon Case Study: Project Background and Problem Space-Our Solution: Intent-Based Capacity Planning, Introduction to Auxon-Introduction to Auxon
availability, Indicators, Choosing a Strategy for Superior Data Integrity
- (see also service availability)
availability table, Availability Table

B

B4 network, Hardware
backend servers, Our Software Infrastructure, Load Balancing in the Datacenter
backends, fake, Production Probes
backups (see data integrity)
Bandwidth Enforcer (BwE), Networking
barrier tools, Testing Scalable Tools, Testing Disaster, Distributed Coordination and Locking Services
batch processing pipelines, First Layer: Soft Deletion
batching, Eliminate Batch Load, Batching, Drawbacks of Periodic Pipelines in Distributed Environments
Bazel, Building
best practices
- capacity planning, Capacity Planning
- for change management, Change Management
- error budgets, Error Budgets
- failures, Fail Sanely
- feedback, Introducing a Postmortem Culture
- for incident management, In Summary
- monitoring, Monitoring
- overloads and failure, Overloads and Failure
- postmortems, Google’s Postmortem Philosophy-Collaborate and Share Knowledge, Postmortems
- reward systems, Introducing a Postmortem Culture
- role of release engineers in, The Role of a Release Engineer
- rollouts, Progressive Rollouts
- service level objectives, Define SLOs Like a User
- team building, SRE Teams
bibliography, Bibliography
Big Data, Origin of the Pipeline Design Pattern
Bigtable, Storage, Target level of availability, Bigtable SRE: A Tale of Over-Alerting
bimodal latency, Bimodal latency
black-box monitoring, Definitions, Black-Box Versus White-Box, Black-Box Monitoring
blameless cultures, Google’s Postmortem Philosophy
Blaze build tool, Building
Blobstore, Storage, Choosing a Strategy for Superior Data Integrity
Borg, Hardware-Managing Machines, Borg: Birth of the Warehouse-Scale Computer-Borg: Birth of the Warehouse-Scale Computer, Drawbacks of Periodic Pipelines in Distributed Environments
Borg Naming Service (BNS), Managing Machines
Borgmon, The Rise of Borgmon-Ten Years On…
- (see also time-series monitoring)
- alerting, Monitoring and Alerting, Alerting
- configuration, Maintaining the Configuration
- rate() function, Rule Evaluation
- rules, Rule Evaluation-Rule Evaluation
- sharding, Sharding the Monitoring Topology
- timeseries arena, Storage in the Time-Series Arena
- vectors, Labels and Vectors-Labels and Vectors
break-glass mechanisms, Expect Testing Fail
build environments, Creating a Test and Build Environment
business continuity, Ensuring Business Continuity
Byzantine failures, How Distributed Consensus Works, Number of Replicas

C

campuses, Hardware
canarying, Motivation for Error Budgets, What we learned, Canary test, Gradual and Staged Rollouts
CAP theorem, Managing Critical State: Distributed Consensus for Reliability
CAPA (corrective and preventative action), Postmortem Culture
capacity planning
- approaches to, Practices
- best practices for, Capacity Planning
- Diskerase example, Recommendations
- distributed consensus systems and, Capacity and Load Balancing
- drawbacks of "queries per second", The Pitfalls of “Queries per Second”
- drawbacks of traditional plans, Brittle by nature
- further reading on, Practices
- intent-based (see intent-based capacity planning)
- mandatory steps for, Demand Forecasting and Capacity Planning
- preventing server overload with, Preventing Server Overload
- product launches and, Capacity Planning
- traditional approach to, Traditional Capacity Planning
cascading failures
- addressing, Immediate Steps to Address Cascading Failures-Eliminate Bad Traffic
- causes of, Causes of Cascading Failures and Designing to Avoid Them-Service Unavailability
- defined, Addressing Cascading Failures, Capacity and Load Balancing
- factors triggering, Triggering Conditions for Cascading Failures
- overview of, Closing Remarks
- preventing server overload, Preventing Server Overload-Always Go Downward in the Stack
- testing for, Testing for Cascading Failures-Test Noncritical Backends
  - (see also overload handling)
change management, Change Management
- (see also automation)
change-induced emergencies, Change-Induced Emergency-What we learned
changelists (CLs), Our Development Environment
Chaos Monkey, Testing Disaster
checkpoint state, Testing Disaster
cherry picking tactic, Hermetic Builds
Chubby lock service, Lock Service, System Architecture Patterns for Distributed Consensus
- planned outage, Objectives, SLOs Set Expectations
client tasks, Load Balancing in the Datacenter
client-side throttling, Client-Side Throttling
clients, Our Software Infrastructure
clock drift, Managing Critical State: Distributed Consensus for Reliability
Clos network fabric, Hardware
cloud environment
- data integrity strategies, Choosing a Strategy for Superior Data Integrity, Challenges faced by cloud developers
- definition of data integrity in, Data Integrity’s Strict Requirements
- evolution of applications in, Choosing a Strategy for Superior Data Integrity
- technical challenges of, Requirements of the Cloud Environment in Perspective
clusters
- applying automation to turnups, Soothing the Pain: Applying Automation to Cluster Turnups-Service-Oriented Cluster-Turnup
- cluster management solution, Drawbacks of Periodic Pipelines in Distributed Environments
- defined, Hardware
code samples, Using Code Examples
cognitive flow state, Cognitive Flow State
cold caching, Slow Startup and Cold Caching
colocation facilities (colos), Recommendations
Colossus, Storage
command posts, A Recognized Command Post
communication and collaboration
- blameless postmortems, Collaborate and Share Knowledge
- case studies, Case Study of Collaboration in SRE: Viceroy-Case Study: Migrating DFP to F1
- importance of, Conclusion
- with Outalator, Reporting and communication
- outside SRE team, Collaboration Outside SRE
- position of SRE in Google, Communication and Collaboration in SRE
- production meetings (see production meetings)
- within SRE team, Collaboration within SRE
company-wide resilience testing, Practices
compensation structure, Compensation
computational optimization, Our Solution: Intent-Based Capacity Planning
configuration management, Configuration Management, Change-Induced Emergency, Integration, Process Updates
configuration tests, Configuration test
consensus algorithms
- Egalitarian Paxos, Stable Leaders
- Fast Paxos, Reasoning About Performance: Fast Paxos, The Use of Paxos
- improving performance of, Distributed Consensus Performance
- Multi-Paxos, Disk Access
- Paxos, How Distributed Consensus Works, Disk Access
- Raft, Multi-Paxos: Detailed Message Flow, Stable Leaders
- Zab, Stable Leaders
  - (see also distributed consensus systems)
consistency
- eventual, Managing Critical State: Distributed Consensus for Reliability
- through automation, Consistency
consistent hashing, Load Balancing at the Virtual IP Address
constraints, Laborious and imprecise
Consul, System Architecture Patterns for Distributed Consensus
consumer services, identifying risk tolerance of, Identifying the Risk Tolerance of Consumer Services-Other service metrics
continuous build and deployment
- Blaze build tool, Building
- branching, Branching
- build targets, Building
- configuration management, Configuration Management
- deployment, Deployment
- packaging, Packaging
- Rapid release system, Continuous Build and Deployment, Rapid
- testing, Testing
- typical release process, Rapid
contributors, Acknowledgments-Acknowledgments
coroutines, Origin of the Pipeline Design Pattern
corporate network security, Practices
correctness guarantees, Workflow Correctness Guarantees
correlation vs. causation, Theory
costs
- availability targets and, Cost, Cost
- direct, The Sysadmin Approach to Service Management
- of failing to embrace risk, Managing Risk
- indirect, The Sysadmin Approach to Service Management
- of sysadmin management approach, The Sysadmin Approach to Service Management
CPU consumption, The Pitfalls of “Queries per Second”, CPU, Overload Behavior and Load Tests
crash-fail vs. crash-recover algorithms, How Distributed Consensus Works
cron
- at large scale, Running Large Cron
- building at Google, Building Cron at Google-Running Large Cron
- idempotency, Cron Jobs and Idempotency
- large-scale deployment of, Cron at Large Scale
- leader and followers, The leader
- overview of, Summary
- Paxos algorithm and, The Use of Paxos-Storing the State
- purpose of, Distributed Periodic Scheduling with Cron
- reliability applications of, Reliability Perspective
- resolving partial failures, Resolving partial failures
- storing state, Storing the State
- tracking cron job state, Tracking the State of Cron Jobs
- uses for, Cron
cross-industry lessons
- Apollo 8, Preface
- comparative questions presented, Lessons Learned from Other Industries
- decision-making skills, Structured and Rational Decision Making-Structured and Rational Decision Making
- Google's application of, Conclusions
- industry leaders contributing, Meet Our Industry Veterans
- key themes addressed, Lessons Learned from Other Industries
- postmortem culture, Postmortem Culture-Postmortem Culture
- preparedness and disaster testing, Preparedness and Disaster Testing-Defense in Depth and Breadth
- repetitive work/operational overhead, Automating Away Repetitive Work and Operational Overhead
current state, exposing, Examine

D

D storage layer, Storage
dashboards
- benefits of, Why Monitor?
- defined, Definitions
data analysis, with Outalator, Analysis
data integrity
- backups vs. archives, Backups Versus Archives
- case studies in, Case Studies-Addressing the root cause
- conditions leading to failure, Types of Failures That Lead to Data Loss
- defined, Data Integrity: What You Read Is What You Wrote
- expanded definition of, Data Integrity’s Strict Requirements
- failure modes, The 24 Combinations of Data Integrity Failure Modes
- from users’ perspective, Data Integrity Is the Means; Data Availability Is the Goal
- overview of, Conclusion
- selecting strategy for, Choosing a Strategy for Superior Data Integrity-Choosing a Strategy for Superior Data Integrity, Challenges faced by cloud developers
- SRE approach to, How Google SRE Faces the Challenges of Data Integrity-Knowing That Data Recovery Will Work
- SRE objectives for, Google SRE Objectives in Maintaining Data Integrity and Availability-Retention
- SRE principles applied to, General Principles of SRE as Applied to Data Integrity-Defense in Depth
- strict requirements, Data Integrity’s Strict Requirements
- technical challenges of, Requirements of the Cloud Environment in Perspective
data processing pipelines
- business continuity and, Ensuring Business Continuity
- challenges of uneven work distribution, Trouble Caused By Uneven Work Distribution
- challenges to periodic pattern, Challenges with the Periodic Pipeline Pattern
- drawbacks of periodic, Drawbacks of Periodic Pipelines in Distributed Environments-Moiré Load Pattern
- effect of big data on, Initial Effect of Big Data on the Simple Pipeline Pattern
- monitoring problems, Monitoring Problems in Periodic Pipelines-Moiré Load Pattern
- origin of, Origin of the Pipeline Design Pattern
- overview of, Summary and Concluding Remarks
- pipeline depth, Initial Effect of Big Data on the Simple Pipeline Pattern
- simple vs. multiphase pipelines, Initial Effect of Big Data on the Simple Pipeline Pattern
- Workflow system, Introduction to Google Workflow, Workflow Correctness Guarantees
data recovery, Knowing That Data Recovery Will Work
datacenters
- backbone network for, Hardware
- data validation, Out-of-band data validation
- load balancing, Load Balancing in the Datacenter-Weighted Round Robin
- topology of, Hardware
datastores
- ACID and BASE, Managing Critical State: Distributed Consensus for Reliability, Choosing a Strategy for Superior Data Integrity, Types of Failures That Lead to Data Loss
- reliable replicated, Reliable Replicated Datastores and Configuration Stores
Decider, Automate Yourself Out of a Job: Automate ALL the Things!
decision-making skills, Structured and Rational Decision Making
defense in depth, for data integrity, The 24 Combinations of Data Integrity Failure Modes, Sunday, February 27, 2011, late in the evening, Defense in Depth
demand forecasting, Demand Forecasting and Capacity Planning
dependency hierarchies, Setting Reasonable Expectations for Monitoring, Dependencies among resources
deployment, Deployment
- (see also continuous build and deployment)
development environment, Our Development Environment
development/ops split, The Sysadmin Approach to Service Management
DevOps, Google’s Approach to Service Management: Site Reliability Engineering
Direct Server Response (DSR), Load Balancing at the Virtual IP Address
disaster recovery tools, Testing Disaster
disaster role playing, Disaster Role Playing
disaster testing, Preparedness and Disaster Testing-Defense in Depth and Breadth
- Disaster and Recovery Testing (DiRT), Preparedness and Disaster Testing
disk access, Disk Access
Diskerase process, Recommendations
distractibility, Distractibility
distributed consensus systems
- benefits of, Managing Critical State: Distributed Consensus for Reliability
- coordination, use in, Distributed Coordination and Locking Services
- deploying, Deploying Distributed Consensus-Based Systems-Quorum composition
- locking, use in, Managing Critical State: Distributed Consensus for Reliability
- monitoring, Monitoring Distributed Consensus Systems
- need for, Managing Critical State: Distributed Consensus for Reliability
- overview of, Conclusion
- patterns for, System Architecture Patterns for Distributed Consensus-Reliable Distributed Queuing and Messaging
- performance of, Distributed Consensus Performance-Disk Access
- principles, How Distributed Consensus Works
- quorum composition, Quorum composition
- quorum leasing technique, Quorum Leases
  - (see also consensus algorithms)
distributed periodic scheduling (see cron)
DNS (Domain Name System)
- EDNS0 extension, Load Balancing Using DNS
- load balancing using, Load Balancing Using DNS-Load Balancing Using DNS
DoubleClick for Publishers (DFP), Case Study: Migrating DFP to F1-Case Study: Migrating DFP to F1
drains, Planned Changes, Drains, or Turndowns
DTSS communication files, Origin of the Pipeline Design Pattern
dueling proposers situation, Multi-Paxos: Detailed Message Flow
durability, Indicators

E

early detection for data integrity, Third Layer: Early Detection
- (see also data integrity)
Early Engagement Model, Evolving the Simple PRR Model: Early Engagement-Disengaging from a service
“embarrassingly parallel” algorithms, Trouble Caused By Uneven Work Distribution
embedded engineers, Embedding an SRE to Recover from Operational Overload-Conclusion
emergency preparedness, Sunday, February 27, 2011, late in the evening
- cross-industry lessons, Preparedness and Disaster Testing
emergency response
- change-induced emergencies, Change-Induced Emergency-What we learned
- essential elements of, Emergency Response
- Five Whys, Ask “what,” “where,” and “why”, Example Postmortem
- guidelines for, Emergency Response
- initial response, What to Do When Systems Break
- lessons learned, Keep a History of Outages
- overview of, Conclusion
- process-induced emergencies, Process-Induced Emergency
- solution availability, All Problems Have Solutions
- test-induced emergencies, Test-Induced Emergency
encapsulation, Load Balancing at the Virtual IP Address
endpoints, in debugging, Examine
engagements (see SRE engagement model)
error budgets
- benefits of, Benefits
- best practices for, Error Budgets
- forming, Forming Your Error Budget
- guidelines for, Pursuing Maximum Change Velocity Without Violating a Service’s SLO
- motivation for, Motivation for Error Budgets
error rates, Indicators, The Four Golden Signals
Escalator, Escalator
ETL pipelines, Origin of the Pipeline Design Pattern
eventual consistency, Managing Critical State: Distributed Consensus for Reliability
executor load average, Utilization Signals

F

failures, best practices for, Fail Sanely
- (see also cascading failures)
fake backends, Production Probes
false-positive alerts, Tagging
feature flag frameworks, Feature Flag Frameworks
file descriptors, File descriptors
Five Whys, Ask “what,” “where,” and “why”, Example Postmortem
flow control, A Simple Approach to Unhealthy Tasks: Flow Control
FLP impossibility result, How Distributed Consensus Works
Flume, Challenges with the Periodic Pipeline Pattern
fragmentation, Load Balancing at the Virtual IP Address

G

gated operations, Enforcement of Policies and Procedures
Generic Routing Encapsulation (GRE), Load Balancing at the Virtual IP Address
GFE (Google Frontend), Life of a Request, Load Balancing in the Datacenter
GFS (Google File System), Detecting Inconsistencies with Prodtest, Highly Available Processing Using Leader Election, Extended Infrastructure-Tracking the State of Cron Jobs, Overarching Layer: Replication
global overload, Per-Customer Limits
Global Software Load Balancer (GSLB), Networking
Gmail, Gmail: Predictable, Scriptable Responses from Humans, Gmail—February, 2011: Restore from GTape
Google Apps for Work, Target level of availability
Google Compute Engine, Indicators
Google production environment
- best practices for, Fail Sanely-SRE Teams
- complexity of, Software Engineering in SRE
- datacenter topology, Hardware
- development environment, Our Development Environment
- hardware, Hardware
- Shakespeare search service, Shakespeare: A Sample Service-Job and Data Organization
- software infrastructure, Our Software Infrastructure
- system software, System Software That “Organizes” the Hardware-Monitoring and Alerting
Google Workflow system
- as model-view-controller pattern, Workflow as Model-View-Controller Pattern
- business continuity and, Ensuring Business Continuity
- correctness guarantees, Workflow Correctness Guarantees
- development of, Introduction to Google Workflow
- stages of execution in, Stages of Execution in Workflow
graceful degradation, Load Shedding and Graceful Degradation
GTape, Gmail—February, 2011: Restore from GTape

H

Hadoop Distributed File System (HDFS), Storage
handoffs, Clear, Live Handoff
“hanging chunk” problem, Trouble Caused By Uneven Work Distribution
hardware
- managing failures, System Software That “Organizes” the Hardware
- software that “organizes”, System Software That “Organizes” the Hardware-Monitoring and Alerting
- terminology used for, Hardware
health checks, Stop Health Check Failures/Deaths
healthcare.gov, Practices
hermetic builds, Hermetic Builds
hierarchical quorums, Quorum composition
high-velocity approach, Principles, High Velocity
hotspotting, Picking the Right Subset

I

idempotent operations, Resolving Inconsistencies Idempotently, Cron Jobs and Idempotency
incident management
- best practices for, In Summary
- effective, Managing Incidents
- formal protocols for, Feeling Safe
- incident management process, What we learned, Elements of Incident Management Process
- incident response, Practices
- managed incident example, A Managed Incident
- roles, Recursive Separation of Responsibilities
- template for, Example Incident State Document
- unmanaged incident example, Unmanaged Incidents
- when to declare an incident, When to Declare an Incident
infrastructure services
- identifying risk tolerance of, Identifying the Risk Tolerance of Infrastructure Services
- improved SRE through automation, Faster Action
integration proposals, Enforcement of Policies and Procedures
integration tests, Integration tests, Integration
intent-based capacity planning
- Auxon implementation, Introduction to Auxon-Introduction to Auxon
- basic premise of, Our Solution: Intent-Based Capacity Planning
- benefits of, Our Solution: Intent-Based Capacity Planning
- defined, Intent-Based Capacity Planning
- deploying approximation, Approximation
- driving adoption of, Raising Awareness and Driving Adoption-Designing at the right level
- precursors to intent, Precursors to Intent
- requirements and implementation, Requirements and Implementation: Successes and Lessons Learned
- selecting intent level, Intent-Based Capacity Planning
- team dynamics, Team Dynamics
interrupts
- cognitive flow state and, Cognitive Flow State
- dealing with, Dealing with Interrupts
- dealing with high volumes, General suggestions
- determining approach to handling, Factors in Determining How Interrupts Are Handled
- distractibility and, Distractibility
- managing operational load, Managing Operational Load
- on-call engineers and, On-call
- ongoing responsibilities, Ongoing responsibilities
- polarizing time, Polarizing time
- reducing, Reducing Interrupts
- ticket assignments, Tickets
IRC (Internet Relay Chat), A Recognized Command Post

J

jobs, Managing Machines
Jupiter network fabric, Hardware

L

labelsets, Labels and Vectors
lame duck state, A Robust Approach to Unhealthy Tasks: Lame Duck State
latency
- defined, Choosing a Strategy for Superior Data Integrity
- measuring, Indicators
- monitoring for, The Four Golden Signals
launch coordination
- checklist, The Launch Checklist-Example action items, Launch Coordination Checklist
- engineering (LCE), Launch Coordination Engineering, Development of LCE-Infrastructure churn
  - (see also product launches)
lazy deletion, The 24 Combinations of Data Integrity Failure Modes
leader election, Managing Critical State: Distributed Consensus for Reliability, Highly Available Processing Using Leader Election
lease systems, Reliable Distributed Queuing and Messaging
Least-Loaded Round Robin policy, Least-Loaded Round Robin
level of service, Service Level Objectives
- (see also service level objectives (SLOs))
living incident documents, Live Incident State Document
load balancing
- datacenter
  - datacenter services and tasks, Load Balancing in the Datacenter
  - flow control, A Simple Approach to Unhealthy Tasks: Flow Control
  - Google's application of, Load Balancing in the Datacenter
  - handling overload, Handling Overload
  - ideal CPU usage, The Ideal Case, The Pitfalls of “Queries per Second”
  - lame duck state, A Robust Approach to Unhealthy Tasks: Lame Duck State
  - limiting connections pools, Limiting the Connections Pool with Subsetting-A Subset Selection Algorithm: Deterministic Subsetting
  - packet encapsulation, Load Balancing at the Virtual IP Address
  - policies for, Load Balancing Policies-Weighted Round Robin
  - SRE software engineering dynamics, Team Dynamics
- distributed consensus systems and, Capacity and Load Balancing
- frontend
  - optimal solutions for, Power Isn’t the Answer
  - using DNS, Load Balancing Using DNS-Load Balancing Using DNS
  - virtual IP addresses (VIPs), Load Balancing at the Virtual IP Address
- policy
  - Least-Loaded Round Robin, Least-Loaded Round Robin
  - Round Robin, Simple Round Robin
  - Weighted Round Robin, Weighted Round Robin
load shedding, Load Shedding and Graceful Degradation
load tests, Overload Behavior and Load Tests
lock services, Lock Service, Distributed Coordination and Locking Services
logging, Examine
Lustre, Storage

M

machines
- defined, Hardware, Definitions
- managing with software, Managing Machines
majority quorums, Number of Replicas
MapReduce, Challenges with the Periodic Pipeline Pattern
mean time
- between failures (MTBF), Testing for Reliability, Expect Testing Fail
- to failure (MTTF), Emergency Response
- to repair (MTTR), Emergency Response, Faster Repairs, Testing for Reliability
memory exhaustion, Memory
Mencius algorithm, Stable Leaders
meta-software, The Use Cases for Automation
Midas Package Manager (MPM), Packaging
model-view-controller pattern, Workflow as Model-View-Controller Pattern
modularity, Modularity
Moiré load pattern in pipelines, Moiré Load Pattern
monitoring distributed systems
- avoiding complexity in, As Simple as Possible, No Simpler
- benefits of monitoring, Why Monitor?, Practical Alerting from Time-Series Data
- best practices for, Monitoring
- blackbox vs. whitebox, Black-Box Versus White-Box, Black-Box Monitoring
- case studies, Bigtable SRE: A Tale of Over-Alerting-Gmail: Predictable, Scriptable Responses from Humans
- challenges of, Monitoring for the Long Term, Practical Alerting from Time-Series Data
- change-induced emergencies, Response
- creating rules for, Tying These Principles Together
- four golden signals of, The Four Golden Signals
- guidelines for, Monitoring
- instrumentation and performance, Worrying About Your Tail (or, Instrumentation and Performance)
- monitoring philosophy, Tying These Principles Together
- resolution, Choosing an Appropriate Resolution for Measurements
- setting expectations for, Setting Reasonable Expectations for Monitoring
- short- vs. long-term availability, The Long Run
- software for, Monitoring and Alerting
- symptoms vs. causes, Symptoms Versus Causes
- terminology, Definitions
- valid monitoring outputs, Monitoring
  - (see also Borgmon; time-series monitoring)
Multi-Paxos protocol, Multi-Paxos: Detailed Message Flow, Disk Access
- (see also consensus algorithms)
multi-site teams, Balance in Quantity
multidimensional matrices, Labels and Vectors
multiphase pipelines, Initial Effect of Big Data on the Simple Pipeline Pattern
MySQL
- migrating, Automate Yourself Out of a Job: Automate ALL the Things!-Automate Yourself Out of a Job: Automate ALL the Things!, Case Study: Migrating DFP to F1
- test-induced emergencies and, Details

N

N + 2 configuration, Job and Data Organization, Intent-Based Capacity Planning-Introduction to Auxon, Preventing Server Overload, Capacity Planning
negative results, Negative Results Are Magic
Network Address Translation, Load Balancing at the Virtual IP Address
network latency, Distributed Consensus Performance and Network Latency
network load balancer, Load Balancing at the Virtual IP Address
network partitions, Managing Critical State: Distributed Consensus for Reliability
Network Quality of Service (QoS), What we learned, Criticality
network security, Practices
networking, Networking
NORAD Tracks Santa website, Reliable Product Launches at Scale
number of “nines”, Indicators, Availability Table

O

older releases, rebuilding, Hermetic Builds
on-call
- balanced on-call, Balanced On-Call
- benefits of, Conclusions
- best practices for, You’ve Hired Your Next SRE(s), Now What?, Five Practices for Aspiring On-Callers-Shadow On-Call Early and Often
- compensation structure, Compensation
- continuing education, On-Call and Beyond: Rites of Passage, and Practicing Continuing Education
- education practices, You’ve Hired Your Next SRE(s), Now What?, Learning Paths That Are Cumulative and Orderly
- formal incident-management protocols, Feeling Safe
- inappropriate operational loads, Avoiding Inappropriate Operational Load
- initial learning experiences, Initial Learning Experiences: The Case for Structure Over Chaos
- learning checklists, Documentation as Apprenticeship
- overview of, Being On-Call, Closing Thoughts
- resources for, Feeling Safe
- rotation schedules, Life of an On-Call Engineer
- shadow on-call, Shadow On-Call Early and Often
- stress-reduction techniques, Feeling Safe
- target event volume, Ensuring a Durable Focus on Engineering
- targeted project work, Targeted Project Work, Not Menial Work
- team building, You’ve Hired Your Next SRE(s), Now What?
- time requirements, Balance in Quality
- training for, Learning Paths That Are Cumulative and Orderly-A Hunger for Failure: Reading and Sharing Postmortems
- training materials, Creating Stellar Reverse Engineers and Improvisational Thinkers
- typical activities, Life of an On-Call Engineer
one-phase pipelines, Initial Effect of Big Data on the Simple Pipeline Pattern
open commenting/annotation system, Collaborate and Share Knowledge
operational load
- cross-industry lessons, Automating Away Repetitive Work and Operational Overhead
- managing, Managing Operational Load
- ongoing responsibilities, Managing Operational Load
- types of, Dealing with Interrupts
operational overload, Operational Overload
operational underload, A Treacherous Enemy: Operational Underload
operational work (see toil)
out-of-band checks and balances, Choosing a Strategy for Superior Data Integrity, Out-of-band data validation
out-of-band communications systems, What went well
outage tracking
- baselines and progress tracking, Tracking Outages
- benefits of, Unexpected Benefits
- Escalator, Escalator
- Outalator, Outalator-Reporting and communication
Outalator
- aggregation in, Aggregation
- benefits of, Outalator
- building your own, Outalator
- incident analysis, Analysis
- notification process, Outalator
- reporting and communication, Reporting and communication
- tagging in, Tagging
overhead, Toil Defined
overload handling
- approaches to, Handling Overload
- best practices for, Overloads and Failure
- client-side throttling, Client-Side Throttling
- load from connections, Load from Connections
- overload errors, Handling Overload Errors
- overview of, Conclusions
- per-client retry budget, Deciding to Retry
- per-customer limits, Per-Customer Limits
- per-request retry budget, Deciding to Retry
- product launches and, Overload Behavior and Load Tests
- request criticality, Criticality
- retrying requests, Deciding to Retry
  - (see also retries, RPC)
- utilization signals, Utilization Signals
  - (see also cascading failures)

P

package managers, Packaging
packet encapsulation, Load Balancing at the Virtual IP Address
Paxos consensus algorithm
- Classic Paxos algorithm, Reasoning About Performance: Fast Paxos
- disk access and, Disk Access
- Egalitarian Paxos consensus algorithm, Stable Leaders
- Fast Paxos consensus algorithm, Reasoning About Performance: Fast Paxos, The Use of Paxos
- Lamport’s Paxos protocol, How Distributed Consensus Works
  - (see also consensus algorithms)
performance
- efficiency and, Efficiency and Performance
- monitoring, Worrying About Your Tail (or, Instrumentation and Performance)
performance tests, System tests
periodic pipelines, Challenges with the Periodic Pipeline Pattern
periodic scheduling (see cron)
persistent storage, Disk Access
Photon, Number of Replicas
pipelining, Batching
planned changes, Planned Changes, Drains, or Turndowns
policies and procedures, enforcing, Enforcement of Policies and Procedures
post hoc analysis, Setting Reasonable Expectations for Monitoring
postmortems
- benefits of, Postmortem Culture: Learning from Failure
- best practices for, Google’s Postmortem Philosophy-Introducing a Postmortem Culture, Postmortems
- collaboration and sharing in, Collaborate and Share Knowledge
- concept of, Postmortem Culture: Learning from Failure
- cross-industry lessons, Postmortem Culture-Postmortem Culture
- example postmortem, Example Postmortem-Timeline
- formal review and publication of, Collaborate and Share Knowledge
- Google's philosophy for, Google’s Postmortem Philosophy
- guidelines for, Ensuring a Durable Focus on Engineering
- introducing postmortem cultures, Introducing a Postmortem Culture
- on-call engineering and, A Hunger for Failure: Reading and Sharing Postmortems
- ongoing improvements to, Conclusion and Ongoing Improvements
- rewarding participation in, Introducing a Postmortem Culture
- triggers for, Google’s Postmortem Philosophy
privacy, Choosing a Strategy for Superior Data Integrity
proactive testing, Encourage Proactive Testing
problem reports, Problem Report
process death, Process Death
process health checks, Stop Health Check Failures/Deaths
process updates, Process Updates
process-induced emergencies, Process-Induced Emergency
Prodtest (Production Test), Detecting Inconsistencies with Prodtest
product launches
- best practices for, Progressive Rollouts
- defined, Reliable Product Launches at Scale
- development of Launch Coordination Engineering (LCE), Development of LCE-Infrastructure churn
- driving convergence and simplification, Driving Convergence and Simplification
- launch coordination checklists, The Launch Checklist-Example action items, Launch Coordination Checklist
- launch coordination engineering, Launch Coordination Engineering
- NORAD Tracks Santa example, Reliable Product Launches at Scale
- overview of, Conclusion
- processes for, Setting Up a Launch Process
- rate of, Reliable Product Launches at Scale
- techniques for reliable, Selected Techniques for Reliable Launches-Overload Behavior and Load Tests
production environment (see Google production environment)
production inconsistencies
- detecting with Prodtest, Detecting Inconsistencies with Prodtest
- resolving idempotently, Resolving Inconsistencies Idempotently
production meetings, Communications: Production Meetings-Attendance
- agenda example, Example Production Meeting Minutes
production probes, Production Probes
Production Readiness Review process (see SRE engagement model)
production tests, Production Tests
protocol buffers (protobufs), Our Software Infrastructure, Integration
Protocol Data Units, Load Balancing at the Virtual IP Address
provisioning, guidelines for, Provisioning
PRR (Production Readiness Review) model, The PRR Model, Production Readiness Reviews: Simple PRR Model-Continuous Improvement
push frequency, Motivation for Error Budgets
push managers, Ongoing responsibilities
Python’s safe_load, Integration

Q

“queries per second” model, The Pitfalls of “Queries per Second”
Query of Death, Process Death
queuing
- controlled delay, Load Shedding and Graceful Degradation
- first-in, first-out, Load Shedding and Graceful Degradation
- last-in, first-out, Load Shedding and Graceful Degradation
- management of, Queue Management, Reliable Distributed Queuing and Messaging
queuing-as-work-distribution pattern, Reliable Distributed Queuing and Messaging
quorum (see distributed consensus systems)

R

Raft consensus protocol, Multi-Paxos: Detailed Message Flow, Stable Leaders
- (see also consensus algorithms)
RAID, Overarching Layer: Replication
Rapid automated release system, Continuous Build and Deployment, Rapid
read workload, scaling, Scaling Read-Heavy Workloads
real backups, Backups Versus Archives
real-time collaboration, Collaborate and Share Knowledge
recoverability, Challenges of Maintaining Data Integrity Deep and Wide
recovery, Knowing That Data Recovery Will Work
recovery systems, Delivering a Recovery System, Rather Than a Backup System
recursion (see recursion)
recursive DNS servers, Load Balancing Using DNS
recursive separation of responsibilities, Recursive Separation of Responsibilities
redundancy, Challenges of Maintaining Data Integrity Deep and Wide, Overarching Layer: Replication
Reed-Solomon erasure codes, Overarching Layer: Replication
regression tests, System tests
release engineering
- challenges of, Release Engineering
- continuous build and deployment, Continuous Build and Deployment-Configuration Management
- defined, Release Engineering
- instituting, Start Release Engineering at the Beginning
- philosophy of, Philosophy-Enforcement of Policies and Procedures
- the role of release engineers, The Role of a Release Engineer
- wider application of, Conclusions
reliability testing
- amount required, Testing for Reliability
- benefits of, Conclusion
- break-glass mechanisms, Expect Testing Fail
- canary tests, Canary test
- configuration tests, Configuration test
- coordination of, The Need for Speed
- creating test and build environments, Creating a Test and Build Environment
- error budgets, Pursuing Maximum Change Velocity Without Violating a Service’s SLO, Motivation for Error Budgets-Forming Your Error Budget, Error Budgets
- expecting test failure, Expect Testing Fail-Expect Testing Fail
- fake backend versions, Production Probes
- goals of, Testing for Reliability
- importance of, Preface
- integration tests, Integration tests, Integration
- MTTR and, Testing for Reliability
- performance tests, System tests
- proactive, Encourage Proactive Testing
- production probes, Production Probes
- production tests, Production Tests
- regression tests, System tests
- reliability goals, Embracing Risk
- sanity testing, System tests
- segregated environments and, Pushing to Production
- smoke tests, System tests
- speed of, The Need for Speed
- statistical tests, Testing Disaster
- stress tests, Stress test
- system tests, System tests
- testing at scale, Testing at Scale-Production Probes
- timing of, Production Tests
- unit tests, Unit tests
reliable replicated datastores, Reliable Replicated Datastores and Configuration Stores
Remote Procedure Call (RPC), Our Software Infrastructure, Examine, Criticality
- bimodal, Bimodal latency
- cancellation, Cancellation propagation
- deadlines
  - missing, Missing deadlines
  - propagating, Load Shedding and Graceful Degradation, Deadline propagation
  - queue management, Queue Management, Reliable Distributed Queuing and Messaging
  - selecting, Latency and Deadlines
- retries, Retries-Retries
- RPC criticality, Criticality
  - (see also overload handling)
replicas
- adding, Capacity and Load Balancing
- drawbacks of leader replicas, Capacity and Load Balancing
- location of, Location of Replicas, Quorum composition
- number deployed, Number of Replicas
replicated logs, Number of Replicas
replicated state machine (RSM), Reliable Replicated State Machines
replication, Challenges of Maintaining Data Integrity Deep and Wide, Overarching Layer: Replication
request latency, Indicators, The Four Golden Signals
request profile changes, Request profile changes
request success rate, Measuring Service Risk
resilience testing, Practices
resources
- allocation of, Hardware, Managing Machines
- exhaustion, Resource Exhaustion
- limits, Resource limits
  - (see also capacity planning)
restores, 1T Versus 1E: Not “Just” a Bigger Backup
retention, Retention
retries, RPC
- avoiding, Deciding to Retry
- cascading failures due to, Retries
- considerations for automatic, Retries
- diagnosing outages due to, Retries
- handling overload errors and, Handling Overload Errors
- per-client retry budgets, Deciding to Retry
- per-request retry budgets, Deciding to Retry
reverse engineering, Reverse Engineers: Figuring Out How Things Work
reverse proxies, What went well
revision history, First Layer: Soft Deletion
risk management
- balancing risk and innovation, Embracing Risk
- costs of, Managing Risk
- error budgets, Motivation for Error Budgets-Benefits, Error Budgets
- key insights, Benefits
- measuring service risk, Measuring Service Risk
- risk tolerance of services, Risk Tolerance of Services-Example: Frontend infrastructure
rollback procedures, What we learned
rollouts, New Rollouts, Rollout Planning, Progressive Rollouts
root cause
- analysis of, Practices, Google’s Postmortem Philosophy
  - (see also postmortems)
- defined, Definitions
Round Robin policy, Simple Round Robin
round-trip-time (RTT), Distributed Consensus Performance and Network Latency
rows, Hardware
rule evaluation, in monitoring systems, Rule Evaluation-Rule Evaluation

S

sanity testing, System tests
saturation, The Four Golden Signals
scale
- defined, Choosing a Strategy for Superior Data Integrity
- issues in, Scaling issues: Fulls, incrementals, and the competing forces of backups and restores
security
- in release engineering, Enforcement of Policies and Procedures
- new approach to, Practices
self-service model, Self-Service Model
separation of responsibilities, Recursive Separation of Responsibilities
servers
- vs. clients, Our Software Infrastructure
- defined, Hardware
- overload scenario, Server Overload
- preventing overload, Preventing Server Overload-Always Go Downward in the Stack
service availability
- availability table, Availability Table
- cost factors, Cost, Cost
- defined, Indicators
- target for consumer services, Target level of availability
- target for infrastructure service, Target level of availability
- time-based equation, Measuring Service Risk
- types of consumer service failures, Types of failures
- types of infrastructure services failures, Types of failures
service health checks, Stop Health Check Failures/Deaths
service latency
- looser approach to, Other service metrics
- monitoring for, The Four Golden Signals
service level agreements (SLAs), Agreements
service level indicators (SLIs)
- aggregating raw measurements, Aggregation
- collecting indicators, Collecting Indicators
- defined, Indicators
- standardizing indicators, Standardize Indicators
service level objectives (SLOs)
- agreements in practice, Agreements in Practice
- best practices for, Define SLOs Like a User
- choosing, Service Level Objectives-Objectives
- control measures, Control Measures
- defined, Objectives
- defining objectives, Objectives in Practice
- selecting relevant indicators, What Do You and Your Users Care About?
- statistical fallacies and, Aggregation
- target selection, Choosing Targets
- user expectations and, Objectives, SLOs Set Expectations
service management
- comprehensive approach to, Preface
- Google’s approach to, Google’s Approach to Service Management: Site Reliability Engineering-Google’s Approach to Service Management: Site Reliability Engineering
- sysadmin approach to, The Sysadmin Approach to Service Management, Consistency
service reliability hierarchy
- additional resources, Practices
- capacity planning, Practices
- development, Practices
- diagram of, Practices
- incident response, Practices
- monitoring, Practices
- product launch, Practices
- root cause analysis, Practices
- testing, Practices
service unavailability, Service Unavailability
Service-Oriented Architecture (SOA), Service-Oriented Cluster-Turnup
Shakespeare search service, example
- alert, Problem Report
- applying SRE to, Shakespeare: A Sample Service-Job and Data Organization
- cascading failure example, Addressing Cascading Failures-Eliminate Bad Traffic
- debugging, Examine
- engagement, Eliminate Bad Traffic, Engagement
- incident management, Example Incident State Document
- postmortem, Example Postmortem-Timeline
- production meeting, Example Production Meeting Minutes-Example Production Meeting Minutes
sharded deployments, Capacity and Load Balancing
SHEDDABLE_PLUS criticality value, Criticality
simplicity, Simplicity-A Simple Conclusion
Sisyphus automation framework, Deployment
Site Reliability Engineering (SRE)
- activities included in, Practices
- approach to learning, Preface
- basic components of, Preface
- benefits of, Google’s Approach to Service Management: Site Reliability Engineering
- challenges of, Google’s Approach to Service Management: Site Reliability Engineering
- defined, Foreword-Foreword, Google’s Approach to Service Management: Site Reliability Engineering
- early engineers, Preface
- Google’s approach to management, Google’s Approach to Service Management: Site Reliability Engineering-Google’s Approach to Service Management: Site Reliability Engineering, Communication and Collaboration in SRE
- growth of at Google, Conclusion, Conclusion
- hiring, Google’s Approach to Service Management: Site Reliability Engineering, You’ve Hired Your Next SRE(s), Now What?
- origins of, Preface
- sysadmin approach to management, The Sysadmin Approach to Service Management, Consistency
- team composition and skills, Google’s Approach to Service Management: Site Reliability Engineering, Introduction, Conclusion
- tenets of, Tenets of SRE-Efficiency and Performance
- typical activities of, What Qualifies as Engineering?
- widespread applications of, Preface
slow startup, Slow Startup and Cold Caching
smoke tests, System tests
SNMP (Simple Network Management Protocol), Collection of Exported Data
soft deletion, First Layer: Soft Deletion
software bloat, The “Negative Lines of Code” Metric
software engineering in SRE
- activities included in, Practices
- Auxon case study, Auxon Case Study: Project Background and Problem Space-Our Solution: Intent-Based Capacity Planning
- benefits of, Conclusions
- encouraging, Raising Awareness and Driving Adoption
- fostering, Fostering Software Engineering in SRE
- Google's focus on, Software Engineering in SRE
- importance of, Why Is Software Engineering Within SRE Important?
- intent-based capacity planning, Our Solution: Intent-Based Capacity Planning-Team Dynamics
- staffing and development time, Successfully Building a Software Engineering Culture in SRE: Staffing and Development Time
- team dynamics, Team Dynamics
software fault tolerance, Motivation for Error Budgets
software simplicity
- avoiding bloat, The “Negative Lines of Code” Metric
- modularity, Modularity
- predictability and, The Virtue of Boring
- release simplicity, Release Simplicity
- reliability and, A Simple Conclusion
- source code purges, I Won’t Give Up My Code!
- system stability versus agility, Simplicity
- writing minimal APIs, Minimal APIs
Spanner, Storage, Cost, Ensuring Business Continuity
SRE engagement model
- aspects addressed by, The SRE Engagement Model
- Early Engagement Model, Evolving the Simple PRR Model: Early Engagement-Disengaging from a service
- frameworks and platforms in, Evolving Services Development: Frameworks and SRE Platform-A new engagement model based on shared responsibility
- importance of, SRE Engagement: What, How, and Why
- Production Readiness Review, The PRR Model, Production Readiness Reviews: Simple PRR Model-Continuous Improvement
SRE tools
- automation tools, Testing Scalable Tools
- barrier tools, Testing Scalable Tools, Testing Disaster
- disaster recovery tools, Testing Disaster
- testing, Testing Scalable Tools
- writing, Integration
SRE Way, The End of the Beginning
stability vs. agility, System Stability Versus Agility
- (see also software simplicity)
stable leaders, Stable Leaders
statistical tests, Testing Disaster
storage stack, Storage
stress tests, Stress test
strong leader process, Multi-Paxos: Detailed Message Flow
Stubby, Our Software Infrastructure
subsetting
- defined, Limiting the Connections Pool with Subsetting
- deterministic, A Subset Selection Algorithm: Deterministic Subsetting
- process of, Limiting the Connections Pool with Subsetting
- random, A Subset Selection Algorithm: Random Subsetting
- selecting subsets, Picking the Right Subset
synchronous consensus, How Distributed Consensus Works
sysadmins (systems administrators), The Sysadmin Approach to Service Management, Consistency
system software
- managing failures with, System Software That “Organizes” the Hardware
- managing machines, Managing Machines
- storage, Storage
system tests, System tests
system throughput, Indicators
systems administrators (sysadmins), The Sysadmin Approach to Service Management, Consistency
systems engineering, Management

T

tagging, Tagging
“task overloaded” errors, Handling Overload Errors
tasks
- backend, Load Balancing in the Datacenter
- client, Load Balancing in the Datacenter
- defined, Managing Machines
TCP/IP communication protocol, Distributed Consensus Performance and Network Latency
team building
- benefits of Google's approach to, Google’s Approach to Service Management: Site Reliability Engineering, Conclusion
- best practices for, SRE Teams
- development focus, Google’s Approach to Service Management: Site Reliability Engineering
- dynamics of SRE software engineering, Team Dynamics
- eliminating complexity, The Virtue of Boring
- engineering focus, Google’s Approach to Service Management: Site Reliability Engineering, Ensuring a Durable Focus on Engineering, What Qualifies as Engineering?, Introduction-Balance in Quantity, Conclusion
- multi-site teams, Balance in Quantity
- self-sufficiency, Self-Service Model
- skills needed, Google’s Approach to Service Management: Site Reliability Engineering
- staffing and development time, Successfully Building a Software Engineering Culture in SRE: Staffing and Development Time
- team composition, Google’s Approach to Service Management: Site Reliability Engineering
terminology (Google-specific)
- campuses, Hardware
- clients, Our Software Infrastructure
- clusters, Hardware
- datacenters, Hardware
- frontend/backend, Our Software Infrastructure
- jobs, Managing Machines
- machines, Hardware
- protocol buffers (protobufs), Our Software Infrastructure
- racks, Hardware
- rows, Hardware
- servers, Hardware, Our Software Infrastructure
- tasks, Managing Machines
test environments, Creating a Test and Build Environment
- (see also reliability testing)
test-induced emergencies, Test-Induced Emergency
testing (see reliability testing)
text logs, Examine
thread starvation, Threads
throttling
- adaptive, Client-Side Throttling
- client-side, Client-Side Throttling
“thundering herd” problems, “Thundering Herd” Problems, Dealing with Abusive Client Behavior
time-based availability equation, Measuring Service Risk, Availability Table
Time-Series Database (TSDB), Storage in the Time-Series Arena
time-series monitoring
- alerting, Alerting
- black-box monitoring, Black-Box Monitoring
- Borgmon monitoring system, The Rise of Borgmon
- collection of exported data, Collection of Exported Data
- instrumentation of applications, Instrumentation of Applications
- maintaining Borgmon configuration, Maintaining the Configuration
- monitoring topology, Sharding the Monitoring Topology
- practical approach to, Practical Alerting from Time-Series Data
- rule evaluation, Rule Evaluation-Rule Evaluation
- scaling, Ten Years On…
- time-series data storage, Storage in the Time-Series Arena-Labels and Vectors
- tools for, The Rise of Borgmon
time-to-live (TTL), Load Balancing Using DNS
timestamps, Reliable Replicated Datastores and Configuration Stores
toil
- benefits of limiting, Principles, Why Less Toil Is Better
- calculating, Why Less Toil Is Better
- characteristics of, Toil Defined
- cross-industry lessons, Automating Away Repetitive Work and Operational Overhead
- defined, Toil Defined
- drawbacks of, Is Toil Always Bad?
- vs. engineering work, What Qualifies as Engineering?
traffic analysis, Life of a Request-Job and Data Organization, The Four Golden Signals
training practices, You’ve Hired Your Next SRE(s), Now What?, Learning Paths That Are Cumulative and Orderly-Learning Paths That Are Cumulative and Orderly
triage process, Triage
Trivial File Transfer Protocol (TFTP), What we learned
troubleshooting
- App Engine case study, Case Study-Case Study
- approaches to, Effective Troubleshooting
- common pitfalls, Theory
- curing issues, Cure
- diagnosing issues, Diagnose-Specific diagnoses
- examining system components, Examine
- logging, Examine
- model of, Theory
- pitfalls, Theory-Theory
- problem reports, Problem Report
- process diagram, Theory
- simplifying, Making Troubleshooting Easier
- systematic approach to, Conclusion
- testing and treating issues, Test and Treat-Negative Results Are Magic
- triage, Triage
turndown automation, What went well, Planned Changes, Drains, or Turndowns
typographical conventions, Conventions Used in This Book

U

unit tests, Unit tests
UNIX pipe, Origin of the Pipeline Design Pattern
unplanned downtime, Measuring Service Risk
uptime, Choosing a Strategy for Superior Data Integrity
user requests
- criticality values assigned to, Criticality
- job and data organization, Job and Data Organization
- monitoring failures, The Four Golden Signals
- request latency, Indicators
- request latency monitoring, The Four Golden Signals
- retrying, Deciding to Retry
- servicing of, Life of a Request
- success rate metrics, Measuring Service Risk
- traffic analysis, Job and Data Organization, The Four Golden Signals
utilization signals, Utilization Signals

V

variable expressions, Labels and Vectors
vectors, Labels and Vectors
velocity, Choosing a Strategy for Superior Data Integrity
Viceroy project, Case Study of Collaboration in SRE: Viceroy-Recommendations
virtual IP addresses (VIPs), Load Balancing at the Virtual IP Address

W

“War Rooms”, A Recognized Command Post
Weighted Round Robin policy, Weighted Round Robin
Wheel of Misfortune exercise, Introducing a Postmortem Culture
white-box monitoring, Definitions, Black-Box Versus White-Box, The Rise of Borgmon
workloads, Distributed Consensus Performance

Y

yield, Indicators
YouTube, Target level of availability

Z

Zab consensus, Stable Leaders
Zookeeper, System Architecture Patterns for Distributed Consensus

Previous Chapter

About the Authors