Table of Contents for
Site Reliability Engineering

Version ebook / Retour

Cover image for bash Cookbook, 2nd Edition Site Reliability Engineering by Jennifer Petoff Published by O'Reilly Media, Inc., 2016
  1. nav
  2. Cover
  3. Praise for Site Reliability Engineering
  4. Site Reliability Engineering
  5. Site Reliability Engineering
  6. Foreword
  7. Preface
  8. I. Introduction
  9. 1. Introduction
  10. 2. The Production Environment at Google, from the Viewpoint of an SRE
  11. II. Principles
  12. 3. Embracing Risk
  13. 4. Service Level Objectives
  14. 5. Eliminating Toil
  15. 6. Monitoring Distributed Systems
  16. 7. The Evolution of Automation at Google
  17. 8. Release Engineering
  18. 9. Simplicity
  19. III. Practices
  20. 10. Practical Alerting from Time-Series Data
  21. 11. Being On-Call
  22. 12. Effective Troubleshooting
  23. 13. Emergency Response
  24. 14. Managing Incidents
  25. 15. Postmortem Culture: Learning from Failure
  26. 16. Tracking Outages
  27. 17. Testing for Reliability
  28. 18. Software Engineering in SRE
  29. 19. Load Balancing at the Frontend
  30. 20. Load Balancing in the Datacenter
  31. 21. Handling Overload
  32. 22. Addressing Cascading Failures
  33. 23. Managing Critical State: Distributed Consensus for Reliability
  34. 24. Distributed Periodic Scheduling with Cron
  35. 25. Data Processing Pipelines
  36. 26. Data Integrity: What You Read Is What You Wrote
  37. 27. Reliable Product Launches at Scale
  38. IV. Management
  39. 28. Accelerating SREs to On-Call and Beyond
  40. 29. Dealing with Interrupts
  41. 30. Embedding an SRE to Recover from Operational Overload
  42. 31. Communication and Collaboration in SRE
  43. 32. The Evolving SRE Engagement Model
  44. V. Conclusions
  45. 33. Lessons Learned from Other Industries
  46. 34. Conclusion
  47. A. Availability Table
  48. B. A Collection of Best Practices for Production Services
  49. C. Example Incident State Document
  50. D. Example Postmortem
  51. E. Launch Coordination Checklist
  52. F. Example Production Meeting Minutes
  53. Bibliography
  54. Index
  55. About the Authors
  56. Colophon

Part IV. Management

Our final selection of topics covers working together in a team, and working as teams. No SRE is an island, and there are some distinctive ways in which we work.

Any organization that aspires to be serious about running an effective SRE arm needs to consider training. Teaching SREs how to think in a complicated and fast-changing environment with a well-thought-out and well-executed training program has the promise of instilling best practices within a new hire’s first few weeks or months that otherwise would take months or years to accumulate. We discuss strategies for doing just that in Chapter 28, Accelerating SREs to On-Call and Beyond.

As anyone in the operations world knows, responsibility for any significant service comes with a lot of interruptions: production getting in a bad state, people requesting updates to their favorite binary, a long queue of consultation requests…managing interrupts under turbulent conditions is a necessary skill, as we’ll discuss in Chapter 29, Dealing with Interrupts.

If the turbulent conditions have persisted for long enough, an SRE team needs to start recovering from operational overload. We have just the flight plan for you in Chapter 30, Embedding an SRE to Recover from Operational Overload.

We write in Chapter 31, Communication and Collaboration in SRE, about the different roles within SRE; cross-team, cross-site, and cross-continent communication; running production meetings; and case studies of how SRE has collaborated well.

Finally, Chapter 32, The Evolving SRE Engagement Model, examines a cornerstone of the operation of SRE: the production readiness review (PRR), a crucial step in onboarding a new service. We discuss how to conduct PRRs, and how to move beyond this successful, but also limited, model.

Further Reading from Google SRE

Building reliable systems requires a carefully calibrated mix of skills, ranging from software development to the arguably less-well-known systems analysis and engineering disciplines. We write about the latter disciplines in “The Systems Engineering Side of Site Reliability Engineering” [Hix15b].

Hiring SREs well is critical to having a high-functioning reliability organization, as explored in “Hiring Site Reliability Engineers” [Jon15]. Google’s hiring practices have been detailed in texts like Work Rules! [Boc15],1 but hiring SREs has its own set of particularities. Even by Google’s overall standards, SRE candidates are difficult to find and even harder to interview effectively.

1 Written by Laszlo Bock, Google’s Senior VP of People Operations.