Index
A
- abusive client behavior, Dealing with Abusive Client Behavior
- access control, Enforcement of Policies and Procedures
- ACID datastore semantics, Managing Critical State: Distributed Consensus for Reliability, Choosing a Strategy for Superior Data Integrity
- acknowledgments, Acknowledgments-Acknowledgments
- adaptive throttling, Client-Side Throttling
- Ads Database, Automate Yourself Out of a Job: Automate ALL the Things!-Automate Yourself Out of a Job: Automate ALL the Things!
- AdSense, Other service metrics
- aggregate availability equation, Measuring Service Risk, Availability Table
- aggregation, Rule Evaluation, Aggregation
- agility vs. stability, System Stability Versus Agility
- (see also software simplicity)
- Alertmanager service, Alerting
- alerts
- anacron, Reliability Perspective
- Apache Mesos, Managing Machines
- App Engine, Case Study
- archives vs. backups, Backups Versus Archives
- asynchronous distributed consensus, How Distributed Consensus Works
- atomic broadcast systems, Reliable Distributed Queuing and Messaging
- attribution policy, Using Code Examples
- automation
- applying to cluster turnups, Soothing the Pain: Applying Automation to Cluster Turnups-Service-Oriented Cluster-Turnup
- vs. autonomous systems, The Evolution of Automation at Google
- benefits of, The Value of Automation-The Value for Google SRE
- best practices for change management, Change Management
- Borg example, Borg: Birth of the Warehouse-Scale Computer
- cross-industry lessons, Automating Away Repetitive Work and Operational Overhead
- database example, Automate Yourself Out of a Job: Automate ALL the Things!-Automate Yourself Out of a Job: Automate ALL the Things!
- Diskerase example, Recommendations
- focus on reliability, Reliability Is the Fundamental Feature
- Google's approach to, The Value for Google SRE
- hierarchy of automation classes, A Hierarchy of Automation Classes
- recommendations for enacting, Recommendations
- specialized application of, The Inclination to Specialize
- use cases for, The Use Cases for Automation-A Hierarchy of Automation Classes
- automation tools, Testing Scalable Tools
- autonomous systems, The Evolution of Automation at Google
- Auxon case study, Auxon Case Study: Project Background and Problem Space-Our Solution: Intent-Based Capacity Planning, Introduction to Auxon-Introduction to Auxon
- availability, Indicators, Choosing a Strategy for Superior Data Integrity
- (see also service availability)
- availability table, Availability Table
B
- B4 network, Hardware
- backend servers, Our Software Infrastructure, Load Balancing in the Datacenter
- backends, fake, Production Probes
- backups (see data integrity)
- Bandwidth Enforcer (BwE), Networking
- barrier tools, Testing Scalable Tools, Testing Disaster, Distributed Coordination and Locking Services
- batch processing pipelines, First Layer: Soft Deletion
- batching, Eliminate Batch Load, Batching, Drawbacks of Periodic Pipelines in Distributed Environments
- Bazel, Building
- best practices
- capacity planning, Capacity Planning
- for change management, Change Management
- error budgets, Error Budgets
- failures, Fail Sanely
- feedback, Introducing a Postmortem Culture
- for incident management, In Summary
- monitoring, Monitoring
- overloads and failure, Overloads and Failure
- postmortems, Google’s Postmortem Philosophy-Collaborate and Share Knowledge, Postmortems
- reward systems, Introducing a Postmortem Culture
- role of release engineers in, The Role of a Release Engineer
- rollouts, Progressive Rollouts
- service level objectives, Define SLOs Like a User
- team building, SRE Teams
- bibliography, Bibliography
- Big Data, Origin of the Pipeline Design Pattern
- Bigtable, Storage, Target level of availability, Bigtable SRE: A Tale of Over-Alerting
- bimodal latency, Bimodal latency
- black-box monitoring, Definitions, Black-Box Versus White-Box, Black-Box Monitoring
- blameless cultures, Google’s Postmortem Philosophy
- Blaze build tool, Building
- Blobstore, Storage, Choosing a Strategy for Superior Data Integrity
- Borg, Hardware-Managing Machines, Borg: Birth of the Warehouse-Scale Computer-Borg: Birth of the Warehouse-Scale Computer, Drawbacks of Periodic Pipelines in Distributed Environments
- Borg Naming Service (BNS), Managing Machines
- Borgmon, The Rise of Borgmon-Ten Years On…
- break-glass mechanisms, Expect Testing Fail
- build environments, Creating a Test and Build Environment
- business continuity, Ensuring Business Continuity
- Byzantine failures, How Distributed Consensus Works, Number of Replicas
C
- campuses, Hardware
- canarying, Motivation for Error Budgets, What we learned, Canary test, Gradual and Staged Rollouts
- CAP theorem, Managing Critical State: Distributed Consensus for Reliability
- CAPA (corrective and preventative action), Postmortem Culture
- capacity planning
- approaches to, Practices
- best practices for, Capacity Planning
- Diskerase example, Recommendations
- distributed consensus systems and, Capacity and Load Balancing
- drawbacks of "queries per second", The Pitfalls of “Queries per Second”
- drawbacks of traditional plans, Brittle by nature
- further reading on, Practices
- intent-based (see intent-based capacity planning)
- mandatory steps for, Demand Forecasting and Capacity Planning
- preventing server overload with, Preventing Server Overload
- product launches and, Capacity Planning
- traditional approach to, Traditional Capacity Planning
- cascading failures
- change management, Change Management
- change-induced emergencies, Change-Induced Emergency-What we learned
- changelists (CLs), Our Development Environment
- Chaos Monkey, Testing Disaster
- checkpoint state, Testing Disaster
- cherry picking tactic, Hermetic Builds
- Chubby lock service, Lock Service, System Architecture Patterns for Distributed Consensus
- client tasks, Load Balancing in the Datacenter
- client-side throttling, Client-Side Throttling
- clients, Our Software Infrastructure
- clock drift, Managing Critical State: Distributed Consensus for Reliability
- Clos network fabric, Hardware
- cloud environment
- clusters
- code samples, Using Code Examples
- cognitive flow state, Cognitive Flow State
- cold caching, Slow Startup and Cold Caching
- colocation facilities (colos), Recommendations
- Colossus, Storage
- command posts, A Recognized Command Post
- communication and collaboration
- company-wide resilience testing, Practices
- compensation structure, Compensation
- computational optimization, Our Solution: Intent-Based Capacity Planning
- configuration management, Configuration Management, Change-Induced Emergency, Integration, Process Updates
- configuration tests, Configuration test
- consensus algorithms
- Egalitarian Paxos, Stable Leaders
- Fast Paxos, Reasoning About Performance: Fast Paxos, The Use of Paxos
- improving performance of, Distributed Consensus Performance
- Multi-Paxos, Disk Access
- Paxos, How Distributed Consensus Works, Disk Access
- Raft, Multi-Paxos: Detailed Message Flow, Stable Leaders
- Zab, Stable Leaders
- (see also distributed consensus systems)
- consistency
- consistent hashing, Load Balancing at the Virtual IP Address
- constraints, Laborious and imprecise
- Consul, System Architecture Patterns for Distributed Consensus
- consumer services, identifying risk tolerance of, Identifying the Risk Tolerance of Consumer Services-Other service metrics
- continuous build and deployment
- Blaze build tool, Building
- branching, Branching
- build targets, Building
- configuration management, Configuration Management
- deployment, Deployment
- packaging, Packaging
- Rapid release system, Continuous Build and Deployment, Rapid
- testing, Testing
- typical release process, Rapid
- contributors, Acknowledgments-Acknowledgments
- coroutines, Origin of the Pipeline Design Pattern
- corporate network security, Practices
- correctness guarantees, Workflow Correctness Guarantees
- correlation vs. causation, Theory
- costs
- CPU consumption, The Pitfalls of “Queries per Second”, CPU, Overload Behavior and Load Tests
- crash-fail vs. crash-recover algorithms, How Distributed Consensus Works
- cron
- at large scale, Running Large Cron
- building at Google, Building Cron at Google-Running Large Cron
- idempotency, Cron Jobs and Idempotency
- large-scale deployment of, Cron at Large Scale
- leader and followers, The leader
- overview of, Summary
- Paxos algorithm and, The Use of Paxos-Storing the State
- purpose of, Distributed Periodic Scheduling with Cron
- reliability applications of, Reliability Perspective
- resolving partial failures, Resolving partial failures
- storing state, Storing the State
- tracking cron job state, Tracking the State of Cron Jobs
- uses for, Cron
- cross-industry lessons
- current state, exposing, Examine
D
- D storage layer, Storage
- dashboards
- data analysis, with Outalator, Analysis
- data integrity
- backups vs. archives, Backups Versus Archives
- case studies in, Case Studies-Addressing the root cause
- conditions leading to failure, Types of Failures That Lead to Data Loss
- defined, Data Integrity: What You Read Is What You Wrote
- expanded definition of, Data Integrity’s Strict Requirements
- failure modes, The 24 Combinations of Data Integrity Failure Modes
- from users’ perspective, Data Integrity Is the Means; Data Availability Is the Goal
- overview of, Conclusion
- selecting strategy for, Choosing a Strategy for Superior Data Integrity-Choosing a Strategy for Superior Data Integrity, Challenges faced by cloud developers
- SRE approach to, How Google SRE Faces the Challenges of Data Integrity-Knowing That Data Recovery Will Work
- SRE objectives for, Google SRE Objectives in Maintaining Data Integrity and Availability-Retention
- SRE principles applied to, General Principles of SRE as Applied to Data Integrity-Defense in Depth
- strict requirements, Data Integrity’s Strict Requirements
- technical challenges of, Requirements of the Cloud Environment in Perspective
- data processing pipelines
- business continuity and, Ensuring Business Continuity
- challenges of uneven work distribution, Trouble Caused By Uneven Work Distribution
- challenges to periodic pattern, Challenges with the Periodic Pipeline Pattern
- drawbacks of periodic, Drawbacks of Periodic Pipelines in Distributed Environments-Moiré Load Pattern
- effect of big data on, Initial Effect of Big Data on the Simple Pipeline Pattern
- monitoring problems, Monitoring Problems in Periodic Pipelines-Moiré Load Pattern
- origin of, Origin of the Pipeline Design Pattern
- overview of, Summary and Concluding Remarks
- pipeline depth, Initial Effect of Big Data on the Simple Pipeline Pattern
- simple vs. multiphase pipelines, Initial Effect of Big Data on the Simple Pipeline Pattern
- Workflow system, Introduction to Google Workflow, Workflow Correctness Guarantees
- data recovery, Knowing That Data Recovery Will Work
- datacenters
- datastores
- Decider, Automate Yourself Out of a Job: Automate ALL the Things!
- decision-making skills, Structured and Rational Decision Making
- defense in depth, for data integrity, The 24 Combinations of Data Integrity Failure Modes, Sunday, February 27, 2011, late in the evening, Defense in Depth
- demand forecasting, Demand Forecasting and Capacity Planning
- dependency hierarchies, Setting Reasonable Expectations for Monitoring, Dependencies among resources
- deployment, Deployment
- (see also continuous build and deployment)
- development environment, Our Development Environment
- development/ops split, The Sysadmin Approach to Service Management
- DevOps, Google’s Approach to Service Management: Site Reliability Engineering
- Direct Server Response (DSR), Load Balancing at the Virtual IP Address
- disaster recovery tools, Testing Disaster
- disaster role playing, Disaster Role Playing
- disaster testing, Preparedness and Disaster Testing-Defense in Depth and Breadth
- disk access, Disk Access
- Diskerase process, Recommendations
- distractibility, Distractibility
- distributed consensus systems
- benefits of, Managing Critical State: Distributed Consensus for Reliability
- coordination, use in, Distributed Coordination and Locking Services
- deploying, Deploying Distributed Consensus-Based Systems-Quorum composition
- locking, use in, Managing Critical State: Distributed Consensus for Reliability
- monitoring, Monitoring Distributed Consensus Systems
- need for, Managing Critical State: Distributed Consensus for Reliability
- overview of, Conclusion
- patterns for, System Architecture Patterns for Distributed Consensus-Reliable Distributed Queuing and Messaging
- performance of, Distributed Consensus Performance-Disk Access
- principles, How Distributed Consensus Works
- quorum composition, Quorum composition
- quorum leasing technique, Quorum Leases
- (see also consensus algorithms)
- distributed periodic scheduling (see cron)
- DNS (Domain Name System)
- DoubleClick for Publishers (DFP), Case Study: Migrating DFP to F1-Case Study: Migrating DFP to F1
- drains, Planned Changes, Drains, or Turndowns
- DTSS communication files, Origin of the Pipeline Design Pattern
- dueling proposers situation, Multi-Paxos: Detailed Message Flow
- durability, Indicators
E
- early detection for data integrity, Third Layer: Early Detection
- (see also data integrity)
- Early Engagement Model, Evolving the Simple PRR Model: Early Engagement-Disengaging from a service
- “embarrassingly parallel” algorithms, Trouble Caused By Uneven Work Distribution
- embedded engineers, Embedding an SRE to Recover from Operational Overload-Conclusion
- emergency preparedness, Sunday, February 27, 2011, late in the evening
- emergency response
- change-induced emergencies, Change-Induced Emergency-What we learned
- essential elements of, Emergency Response
- Five Whys, Ask “what,” “where,” and “why”, Example Postmortem
- guidelines for, Emergency Response
- initial response, What to Do When Systems Break
- lessons learned, Keep a History of Outages
- overview of, Conclusion
- process-induced emergencies, Process-Induced Emergency
- solution availability, All Problems Have Solutions
- test-induced emergencies, Test-Induced Emergency
- encapsulation, Load Balancing at the Virtual IP Address
- endpoints, in debugging, Examine
- engagements (see SRE engagement model)
- error budgets
- error rates, Indicators, The Four Golden Signals
- Escalator, Escalator
- ETL pipelines, Origin of the Pipeline Design Pattern
- eventual consistency, Managing Critical State: Distributed Consensus for Reliability
- executor load average, Utilization Signals
F
- failures, best practices for, Fail Sanely
- (see also cascading failures)
- fake backends, Production Probes
- false-positive alerts, Tagging
- feature flag frameworks, Feature Flag Frameworks
- file descriptors, File descriptors
- Five Whys, Ask “what,” “where,” and “why”, Example Postmortem
- flow control, A Simple Approach to Unhealthy Tasks: Flow Control
- FLP impossibility result, How Distributed Consensus Works
- Flume, Challenges with the Periodic Pipeline Pattern
- fragmentation, Load Balancing at the Virtual IP Address
G
- gated operations, Enforcement of Policies and Procedures
- Generic Routing Encapsulation (GRE), Load Balancing at the Virtual IP Address
- GFE (Google Frontend), Life of a Request, Load Balancing in the Datacenter
- GFS (Google File System), Detecting Inconsistencies with Prodtest, Highly Available Processing Using Leader Election, Extended Infrastructure-Tracking the State of Cron Jobs, Overarching Layer: Replication
- global overload, Per-Customer Limits
- Global Software Load Balancer (GSLB), Networking
- Gmail, Gmail: Predictable, Scriptable Responses from Humans, Gmail—February, 2011: Restore from GTape
- Google Apps for Work, Target level of availability
- Google Compute Engine, Indicators
- Google production environment
- Google Workflow system
- graceful degradation, Load Shedding and Graceful Degradation
- GTape, Gmail—February, 2011: Restore from GTape
H
- Hadoop Distributed File System (HDFS), Storage
- handoffs, Clear, Live Handoff
- “hanging chunk” problem, Trouble Caused By Uneven Work Distribution
- hardware
- health checks, Stop Health Check Failures/Deaths
- healthcare.gov, Practices
- hermetic builds, Hermetic Builds
- hierarchical quorums, Quorum composition
- high-velocity approach, Principles, High Velocity
- hotspotting, Picking the Right Subset
I
- idempotent operations, Resolving Inconsistencies Idempotently, Cron Jobs and Idempotency
- incident management
- best practices for, In Summary
- effective, Managing Incidents
- formal protocols for, Feeling Safe
- incident management process, What we learned, Elements of Incident Management Process
- incident response, Practices
- managed incident example, A Managed Incident
- roles, Recursive Separation of Responsibilities
- template for, Example Incident State Document
- unmanaged incident example, Unmanaged Incidents
- when to declare an incident, When to Declare an Incident
- infrastructure services
- integration proposals, Enforcement of Policies and Procedures
- integration tests, Integration tests, Integration
- intent-based capacity planning
- Auxon implementation, Introduction to Auxon-Introduction to Auxon
- basic premise of, Our Solution: Intent-Based Capacity Planning
- benefits of, Our Solution: Intent-Based Capacity Planning
- defined, Intent-Based Capacity Planning
- deploying approximation, Approximation
- driving adoption of, Raising Awareness and Driving Adoption-Designing at the right level
- precursors to intent, Precursors to Intent
- requirements and implementation, Requirements and Implementation: Successes and Lessons Learned
- selecting intent level, Intent-Based Capacity Planning
- team dynamics, Team Dynamics
- interrupts
- cognitive flow state and, Cognitive Flow State
- dealing with, Dealing with Interrupts
- dealing with high volumes, General suggestions
- determining approach to handling, Factors in Determining How Interrupts Are Handled
- distractibility and, Distractibility
- managing operational load, Managing Operational Load
- on-call engineers and, On-call
- ongoing responsibilities, Ongoing responsibilities
- polarizing time, Polarizing time
- reducing, Reducing Interrupts
- ticket assignments, Tickets
- IRC (Internet Relay Chat), A Recognized Command Post
L
- labelsets, Labels and Vectors
- lame duck state, A Robust Approach to Unhealthy Tasks: Lame Duck State
- latency
- launch coordination
- lazy deletion, The 24 Combinations of Data Integrity Failure Modes
- leader election, Managing Critical State: Distributed Consensus for Reliability, Highly Available Processing Using Leader Election
- lease systems, Reliable Distributed Queuing and Messaging
- Least-Loaded Round Robin policy, Least-Loaded Round Robin
- level of service, Service Level Objectives
- (see also service level objectives (SLOs))
- living incident documents, Live Incident State Document
- load balancing
- datacenter
- datacenter services and tasks, Load Balancing in the Datacenter
- flow control, A Simple Approach to Unhealthy Tasks: Flow Control
- Google's application of, Load Balancing in the Datacenter
- handling overload, Handling Overload
- ideal CPU usage, The Ideal Case, The Pitfalls of “Queries per Second”
- lame duck state, A Robust Approach to Unhealthy Tasks: Lame Duck State
- limiting connections pools, Limiting the Connections Pool with Subsetting-A Subset Selection Algorithm: Deterministic Subsetting
- packet encapsulation, Load Balancing at the Virtual IP Address
- policies for, Load Balancing Policies-Weighted Round Robin
- SRE software engineering dynamics, Team Dynamics
- distributed consensus systems and, Capacity and Load Balancing
- frontend
- policy
- load shedding, Load Shedding and Graceful Degradation
- load tests, Overload Behavior and Load Tests
- lock services, Lock Service, Distributed Coordination and Locking Services
- logging, Examine
- Lustre, Storage
M
- machines
- majority quorums, Number of Replicas
- MapReduce, Challenges with the Periodic Pipeline Pattern
- mean time
- memory exhaustion, Memory
- Mencius algorithm, Stable Leaders
- meta-software, The Use Cases for Automation
- Midas Package Manager (MPM), Packaging
- model-view-controller pattern, Workflow as Model-View-Controller Pattern
- modularity, Modularity
- Moiré load pattern in pipelines, Moiré Load Pattern
- monitoring distributed systems
- avoiding complexity in, As Simple as Possible, No Simpler
- benefits of monitoring, Why Monitor?, Practical Alerting from Time-Series Data
- best practices for, Monitoring
- blackbox vs. whitebox, Black-Box Versus White-Box, Black-Box Monitoring
- case studies, Bigtable SRE: A Tale of Over-Alerting-Gmail: Predictable, Scriptable Responses from Humans
- challenges of, Monitoring for the Long Term, Practical Alerting from Time-Series Data
- change-induced emergencies, Response
- creating rules for, Tying These Principles Together
- four golden signals of, The Four Golden Signals
- guidelines for, Monitoring
- instrumentation and performance, Worrying About Your Tail (or, Instrumentation and Performance)
- monitoring philosophy, Tying These Principles Together
- resolution, Choosing an Appropriate Resolution for Measurements
- setting expectations for, Setting Reasonable Expectations for Monitoring
- short- vs. long-term availability, The Long Run
- software for, Monitoring and Alerting
- symptoms vs. causes, Symptoms Versus Causes
- terminology, Definitions
- valid monitoring outputs, Monitoring
- (see also Borgmon; time-series monitoring)
- Multi-Paxos protocol, Multi-Paxos: Detailed Message Flow, Disk Access
- (see also consensus algorithms)
- multi-site teams, Balance in Quantity
- multidimensional matrices, Labels and Vectors
- multiphase pipelines, Initial Effect of Big Data on the Simple Pipeline Pattern
- MySQL
N
- N + 2 configuration, Job and Data Organization, Intent-Based Capacity Planning-Introduction to Auxon, Preventing Server Overload, Capacity Planning
- negative results, Negative Results Are Magic
- Network Address Translation, Load Balancing at the Virtual IP Address
- network latency, Distributed Consensus Performance and Network Latency
- network load balancer, Load Balancing at the Virtual IP Address
- network partitions, Managing Critical State: Distributed Consensus for Reliability
- Network Quality of Service (QoS), What we learned, Criticality
- network security, Practices
- networking, Networking
- NORAD Tracks Santa website, Reliable Product Launches at Scale
- number of “nines”, Indicators, Availability Table
O
- older releases, rebuilding, Hermetic Builds
- on-call
- balanced on-call, Balanced On-Call
- benefits of, Conclusions
- best practices for, You’ve Hired Your Next SRE(s), Now What?, Five Practices for Aspiring On-Callers-Shadow On-Call Early and Often
- compensation structure, Compensation
- continuing education, On-Call and Beyond: Rites of Passage, and Practicing Continuing Education
- education practices, You’ve Hired Your Next SRE(s), Now What?, Learning Paths That Are Cumulative and Orderly
- formal incident-management protocols, Feeling Safe
- inappropriate operational loads, Avoiding Inappropriate Operational Load
- initial learning experiences, Initial Learning Experiences: The Case for Structure Over Chaos
- learning checklists, Documentation as Apprenticeship
- overview of, Being On-Call, Closing Thoughts
- resources for, Feeling Safe
- rotation schedules, Life of an On-Call Engineer
- shadow on-call, Shadow On-Call Early and Often
- stress-reduction techniques, Feeling Safe
- target event volume, Ensuring a Durable Focus on Engineering
- targeted project work, Targeted Project Work, Not Menial Work
- team building, You’ve Hired Your Next SRE(s), Now What?
- time requirements, Balance in Quality
- training for, Learning Paths That Are Cumulative and Orderly-A Hunger for Failure: Reading and Sharing Postmortems
- training materials, Creating Stellar Reverse Engineers and Improvisational Thinkers
- typical activities, Life of an On-Call Engineer
- one-phase pipelines, Initial Effect of Big Data on the Simple Pipeline Pattern
- open commenting/annotation system, Collaborate and Share Knowledge
- operational load
- operational overload, Operational Overload
- operational underload, A Treacherous Enemy: Operational Underload
- operational work (see toil)
- out-of-band checks and balances, Choosing a Strategy for Superior Data Integrity, Out-of-band data validation
- out-of-band communications systems, What went well
- outage tracking
- Outalator
- overhead, Toil Defined
- overload handling
- approaches to, Handling Overload
- best practices for, Overloads and Failure
- client-side throttling, Client-Side Throttling
- load from connections, Load from Connections
- overload errors, Handling Overload Errors
- overview of, Conclusions
- per-client retry budget, Deciding to Retry
- per-customer limits, Per-Customer Limits
- per-request retry budget, Deciding to Retry
- product launches and, Overload Behavior and Load Tests
- request criticality, Criticality
- retrying requests, Deciding to Retry
- utilization signals, Utilization Signals
- (see also cascading failures)
P
- package managers, Packaging
- packet encapsulation, Load Balancing at the Virtual IP Address
- Paxos consensus algorithm
- performance
- performance tests, System tests
- periodic pipelines, Challenges with the Periodic Pipeline Pattern
- periodic scheduling (see cron)
- persistent storage, Disk Access
- Photon, Number of Replicas
- pipelining, Batching
- planned changes, Planned Changes, Drains, or Turndowns
- policies and procedures, enforcing, Enforcement of Policies and Procedures
- post hoc analysis, Setting Reasonable Expectations for Monitoring
- postmortems
- benefits of, Postmortem Culture: Learning from Failure
- best practices for, Google’s Postmortem Philosophy-Introducing a Postmortem Culture, Postmortems
- collaboration and sharing in, Collaborate and Share Knowledge
- concept of, Postmortem Culture: Learning from Failure
- cross-industry lessons, Postmortem Culture-Postmortem Culture
- example postmortem, Example Postmortem-Timeline
- formal review and publication of, Collaborate and Share Knowledge
- Google's philosophy for, Google’s Postmortem Philosophy
- guidelines for, Ensuring a Durable Focus on Engineering
- introducing postmortem cultures, Introducing a Postmortem Culture
- on-call engineering and, A Hunger for Failure: Reading and Sharing Postmortems
- ongoing improvements to, Conclusion and Ongoing Improvements
- rewarding participation in, Introducing a Postmortem Culture
- triggers for, Google’s Postmortem Philosophy
- privacy, Choosing a Strategy for Superior Data Integrity
- proactive testing, Encourage Proactive Testing
- problem reports, Problem Report
- process death, Process Death
- process health checks, Stop Health Check Failures/Deaths
- process updates, Process Updates
- process-induced emergencies, Process-Induced Emergency
- Prodtest (Production Test), Detecting Inconsistencies with Prodtest
- product launches
- best practices for, Progressive Rollouts
- defined, Reliable Product Launches at Scale
- development of Launch Coordination Engineering (LCE), Development of LCE-Infrastructure churn
- driving convergence and simplification, Driving Convergence and Simplification
- launch coordination checklists, The Launch Checklist-Example action items, Launch Coordination Checklist
- launch coordination engineering, Launch Coordination Engineering
- NORAD Tracks Santa example, Reliable Product Launches at Scale
- overview of, Conclusion
- processes for, Setting Up a Launch Process
- rate of, Reliable Product Launches at Scale
- techniques for reliable, Selected Techniques for Reliable Launches-Overload Behavior and Load Tests
- production environment (see Google production environment)
- production inconsistencies
- production meetings, Communications: Production Meetings-Attendance
- production probes, Production Probes
- Production Readiness Review process (see SRE engagement model)
- production tests, Production Tests
- protocol buffers (protobufs), Our Software Infrastructure, Integration
- Protocol Data Units, Load Balancing at the Virtual IP Address
- provisioning, guidelines for, Provisioning
- PRR (Production Readiness Review) model, The PRR Model, Production Readiness Reviews: Simple PRR Model-Continuous Improvement
- push frequency, Motivation for Error Budgets
- push managers, Ongoing responsibilities
- Python’s safe_load, Integration
R
- Raft consensus protocol, Multi-Paxos: Detailed Message Flow, Stable Leaders
- (see also consensus algorithms)
- RAID, Overarching Layer: Replication
- Rapid automated release system, Continuous Build and Deployment, Rapid
- read workload, scaling, Scaling Read-Heavy Workloads
- real backups, Backups Versus Archives
- real-time collaboration, Collaborate and Share Knowledge
- recoverability, Challenges of Maintaining Data Integrity Deep and Wide
- recovery, Knowing That Data Recovery Will Work
- recovery systems, Delivering a Recovery System, Rather Than a Backup System
- recursion (see recursion)
- recursive DNS servers, Load Balancing Using DNS
- recursive separation of responsibilities, Recursive Separation of Responsibilities
- redundancy, Challenges of Maintaining Data Integrity Deep and Wide, Overarching Layer: Replication
- Reed-Solomon erasure codes, Overarching Layer: Replication
- regression tests, System tests
- release engineering
- reliability testing
- amount required, Testing for Reliability
- benefits of, Conclusion
- break-glass mechanisms, Expect Testing Fail
- canary tests, Canary test
- configuration tests, Configuration test
- coordination of, The Need for Speed
- creating test and build environments, Creating a Test and Build Environment
- error budgets, Pursuing Maximum Change Velocity Without Violating a Service’s SLO, Motivation for Error Budgets-Forming Your Error Budget, Error Budgets
- expecting test failure, Expect Testing Fail-Expect Testing Fail
- fake backend versions, Production Probes
- goals of, Testing for Reliability
- importance of, Preface
- integration tests, Integration tests, Integration
- MTTR and, Testing for Reliability
- performance tests, System tests
- proactive, Encourage Proactive Testing
- production probes, Production Probes
- production tests, Production Tests
- regression tests, System tests
- reliability goals, Embracing Risk
- sanity testing, System tests
- segregated environments and, Pushing to Production
- smoke tests, System tests
- speed of, The Need for Speed
- statistical tests, Testing Disaster
- stress tests, Stress test
- system tests, System tests
- testing at scale, Testing at Scale-Production Probes
- timing of, Production Tests
- unit tests, Unit tests
- reliable replicated datastores, Reliable Replicated Datastores and Configuration Stores
- Remote Procedure Call (RPC), Our Software Infrastructure, Examine, Criticality
- replicas
- replicated logs, Number of Replicas
- replicated state machine (RSM), Reliable Replicated State Machines
- replication, Challenges of Maintaining Data Integrity Deep and Wide, Overarching Layer: Replication
- request latency, Indicators, The Four Golden Signals
- request profile changes, Request profile changes
- request success rate, Measuring Service Risk
- resilience testing, Practices
- resources
- restores, 1T Versus 1E: Not “Just” a Bigger Backup
- retention, Retention
- retries, RPC
- reverse engineering, Reverse Engineers: Figuring Out How Things Work
- reverse proxies, What went well
- revision history, First Layer: Soft Deletion
- risk management
- rollback procedures, What we learned
- rollouts, New Rollouts, Rollout Planning, Progressive Rollouts
- root cause
- Round Robin policy, Simple Round Robin
- round-trip-time (RTT), Distributed Consensus Performance and Network Latency
- rows, Hardware
- rule evaluation, in monitoring systems, Rule Evaluation-Rule Evaluation
S
- sanity testing, System tests
- saturation, The Four Golden Signals
- scale
- security
- self-service model, Self-Service Model
- separation of responsibilities, Recursive Separation of Responsibilities
- servers
- service availability
- service health checks, Stop Health Check Failures/Deaths
- service latency
- service level agreements (SLAs), Agreements
- service level indicators (SLIs)
- service level objectives (SLOs)
- agreements in practice, Agreements in Practice
- best practices for, Define SLOs Like a User
- choosing, Service Level Objectives-Objectives
- control measures, Control Measures
- defined, Objectives
- defining objectives, Objectives in Practice
- selecting relevant indicators, What Do You and Your Users Care About?
- statistical fallacies and, Aggregation
- target selection, Choosing Targets
- user expectations and, Objectives, SLOs Set Expectations
- service management
- service reliability hierarchy
- service unavailability, Service Unavailability
- Service-Oriented Architecture (SOA), Service-Oriented Cluster-Turnup
- Shakespeare search service, example
- sharded deployments, Capacity and Load Balancing
- SHEDDABLE_PLUS criticality value, Criticality
- simplicity, Simplicity-A Simple Conclusion
- Sisyphus automation framework, Deployment
- Site Reliability Engineering (SRE)
- activities included in, Practices
- approach to learning, Preface
- basic components of, Preface
- benefits of, Google’s Approach to Service Management: Site Reliability Engineering
- challenges of, Google’s Approach to Service Management: Site Reliability Engineering
- defined, Foreword-Foreword, Google’s Approach to Service Management: Site Reliability Engineering
- early engineers, Preface
- Google’s approach to management, Google’s Approach to Service Management: Site Reliability Engineering-Google’s Approach to Service Management: Site Reliability Engineering, Communication and Collaboration in SRE
- growth of at Google, Conclusion, Conclusion
- hiring, Google’s Approach to Service Management: Site Reliability Engineering, You’ve Hired Your Next SRE(s), Now What?
- origins of, Preface
- sysadmin approach to management, The Sysadmin Approach to Service Management, Consistency
- team composition and skills, Google’s Approach to Service Management: Site Reliability Engineering, Introduction, Conclusion
- tenets of, Tenets of SRE-Efficiency and Performance
- typical activities of, What Qualifies as Engineering?
- widespread applications of, Preface
- slow startup, Slow Startup and Cold Caching
- smoke tests, System tests
- SNMP (Simple Network Management Protocol), Collection of Exported Data
- soft deletion, First Layer: Soft Deletion
- software bloat, The “Negative Lines of Code” Metric
- software engineering in SRE
- activities included in, Practices
- Auxon case study, Auxon Case Study: Project Background and Problem Space-Our Solution: Intent-Based Capacity Planning
- benefits of, Conclusions
- encouraging, Raising Awareness and Driving Adoption
- fostering, Fostering Software Engineering in SRE
- Google's focus on, Software Engineering in SRE
- importance of, Why Is Software Engineering Within SRE Important?
- intent-based capacity planning, Our Solution: Intent-Based Capacity Planning-Team Dynamics
- staffing and development time, Successfully Building a Software Engineering Culture in SRE: Staffing and Development Time
- team dynamics, Team Dynamics
- software fault tolerance, Motivation for Error Budgets
- software simplicity
- Spanner, Storage, Cost, Ensuring Business Continuity
- SRE engagement model
- SRE tools
- SRE Way, The End of the Beginning
- stability vs. agility, System Stability Versus Agility
- (see also software simplicity)
- stable leaders, Stable Leaders
- statistical tests, Testing Disaster
- storage stack, Storage
- stress tests, Stress test
- strong leader process, Multi-Paxos: Detailed Message Flow
- Stubby, Our Software Infrastructure
- subsetting
- synchronous consensus, How Distributed Consensus Works
- sysadmins (systems administrators), The Sysadmin Approach to Service Management, Consistency
- system software
- system tests, System tests
- system throughput, Indicators
- systems administrators (sysadmins), The Sysadmin Approach to Service Management, Consistency
- systems engineering, Management
T
- tagging, Tagging
- “task overloaded” errors, Handling Overload Errors
- tasks
- TCP/IP communication protocol, Distributed Consensus Performance and Network Latency
- team building
- benefits of Google's approach to, Google’s Approach to Service Management: Site Reliability Engineering, Conclusion
- best practices for, SRE Teams
- development focus, Google’s Approach to Service Management: Site Reliability Engineering
- dynamics of SRE software engineering, Team Dynamics
- eliminating complexity, The Virtue of Boring
- engineering focus, Google’s Approach to Service Management: Site Reliability Engineering, Ensuring a Durable Focus on Engineering, What Qualifies as Engineering?, Introduction-Balance in Quantity, Conclusion
- multi-site teams, Balance in Quantity
- self-sufficiency, Self-Service Model
- skills needed, Google’s Approach to Service Management: Site Reliability Engineering
- staffing and development time, Successfully Building a Software Engineering Culture in SRE: Staffing and Development Time
- team composition, Google’s Approach to Service Management: Site Reliability Engineering
- terminology (Google-specific)
- campuses, Hardware
- clients, Our Software Infrastructure
- clusters, Hardware
- datacenters, Hardware
- frontend/backend, Our Software Infrastructure
- jobs, Managing Machines
- machines, Hardware
- protocol buffers (protobufs), Our Software Infrastructure
- racks, Hardware
- rows, Hardware
- servers, Hardware, Our Software Infrastructure
- tasks, Managing Machines
- test environments, Creating a Test and Build Environment
- (see also reliability testing)
- test-induced emergencies, Test-Induced Emergency
- testing (see reliability testing)
- text logs, Examine
- thread starvation, Threads
- throttling
- “thundering herd” problems, “Thundering Herd” Problems, Dealing with Abusive Client Behavior
- time-based availability equation, Measuring Service Risk, Availability Table
- Time-Series Database (TSDB), Storage in the Time-Series Arena
- time-series monitoring
- alerting, Alerting
- black-box monitoring, Black-Box Monitoring
- Borgmon monitoring system, The Rise of Borgmon
- collection of exported data, Collection of Exported Data
- instrumentation of applications, Instrumentation of Applications
- maintaining Borgmon configuration, Maintaining the Configuration
- monitoring topology, Sharding the Monitoring Topology
- practical approach to, Practical Alerting from Time-Series Data
- rule evaluation, Rule Evaluation-Rule Evaluation
- scaling, Ten Years On…
- time-series data storage, Storage in the Time-Series Arena-Labels and Vectors
- tools for, The Rise of Borgmon
- time-to-live (TTL), Load Balancing Using DNS
- timestamps, Reliable Replicated Datastores and Configuration Stores
- toil
- traffic analysis, Life of a Request-Job and Data Organization, The Four Golden Signals
- training practices, You’ve Hired Your Next SRE(s), Now What?, Learning Paths That Are Cumulative and Orderly-Learning Paths That Are Cumulative and Orderly
- triage process, Triage
- Trivial File Transfer Protocol (TFTP), What we learned
- troubleshooting
- App Engine case study, Case Study-Case Study
- approaches to, Effective Troubleshooting
- common pitfalls, Theory
- curing issues, Cure
- diagnosing issues, Diagnose-Specific diagnoses
- examining system components, Examine
- logging, Examine
- model of, Theory
- pitfalls, Theory-Theory
- problem reports, Problem Report
- process diagram, Theory
- simplifying, Making Troubleshooting Easier
- systematic approach to, Conclusion
- testing and treating issues, Test and Treat-Negative Results Are Magic
- triage, Triage
- turndown automation, What went well, Planned Changes, Drains, or Turndowns
- typographical conventions, Conventions Used in This Book
U
- unit tests, Unit tests
- UNIX pipe, Origin of the Pipeline Design Pattern
- unplanned downtime, Measuring Service Risk
- uptime, Choosing a Strategy for Superior Data Integrity
- user requests
- criticality values assigned to, Criticality
- job and data organization, Job and Data Organization
- monitoring failures, The Four Golden Signals
- request latency, Indicators
- request latency monitoring, The Four Golden Signals
- retrying, Deciding to Retry
- servicing of, Life of a Request
- success rate metrics, Measuring Service Risk
- traffic analysis, Job and Data Organization, The Four Golden Signals
- utilization signals, Utilization Signals