Part IV. Management

Our final selection of topics covers working together in a team, and working as teams. No SRE is an island, and there are some distinctive ways in which we work.

Any organization that aspires to be serious about running an effective SRE arm needs to consider training. Teaching SREs how to think in a complicated and fast-changing environment with a well-thought-out and well-executed training program has the promise of instilling best practices within a new hire’s first few weeks or months that otherwise would take months or years to accumulate. We discuss strategies for doing just that in Chapter 28, Accelerating SREs to On-Call and Beyond.

As anyone in the operations world knows, responsibility for any significant service comes with a lot of interruptions: production getting in a bad state, people requesting updates to their favorite binary, a long queue of consultation requests…managing interrupts under turbulent conditions is a necessary skill, as we’ll discuss in Chapter 29, Dealing with Interrupts.

If the turbulent conditions have persisted for long enough, an SRE team needs to start recovering from operational overload. We have just the flight plan for you in Chapter 30, Embedding an SRE to Recover from Operational Overload.

We write in Chapter 31, Communication and Collaboration in SRE, about the different roles within SRE; cross-team, cross-site, and cross-continent communication; running production meetings; and case studies of how SRE has collaborated well.

Finally, Chapter 32, The Evolving SRE Engagement Model, examines a cornerstone of the operation of SRE: the production readiness review (PRR), a crucial step in onboarding a new service. We discuss how to conduct PRRs, and how to move beyond this successful, but also limited, model.

Table of Contents for Site Reliability Engineering

Part IV. Management

Table of Contents for
Site Reliability Engineering