Table of Contents for
Site Reliability Engineering
Close
Version ebook
/
Retour
Site Reliability Engineering
by Jennifer Petoff
Published by O'Reilly Media, Inc., 2016
nav
Cover
Praise for Site Reliability Engineering
Site Reliability Engineering
Site Reliability Engineering
Foreword
Preface
I. Introduction
1. Introduction
2. The Production Environment at Google, from the Viewpoint of an SRE
II. Principles
3. Embracing Risk
4. Service Level Objectives
5. Eliminating Toil
6. Monitoring Distributed Systems
7. The Evolution of Automation at Google
8. Release Engineering
9. Simplicity
III. Practices
10. Practical Alerting from Time-Series Data
11. Being On-Call
12. Effective Troubleshooting
13. Emergency Response
14. Managing Incidents
15. Postmortem Culture: Learning from Failure
16. Tracking Outages
17. Testing for Reliability
18. Software Engineering in SRE
19. Load Balancing at the Frontend
20. Load Balancing in the Datacenter
21. Handling Overload
22. Addressing Cascading Failures
23. Managing Critical State: Distributed Consensus for Reliability
24. Distributed Periodic Scheduling with Cron
25. Data Processing Pipelines
26. Data Integrity: What You Read Is What You Wrote
27. Reliable Product Launches at Scale
IV. Management
28. Accelerating SREs to On-Call and Beyond
29. Dealing with Interrupts
30. Embedding an SRE to Recover from Operational Overload
31. Communication and Collaboration in SRE
32. The Evolving SRE Engagement Model
V. Conclusions
33. Lessons Learned from Other Industries
34. Conclusion
A. Availability Table
B. A Collection of Best Practices for Production Services
C. Example Incident State Document
D. Example Postmortem
E. Launch Coordination Checklist
F. Example Production Meeting Minutes
Bibliography
Index
About the Authors
Colophon
Next
Next Chapter
Praise for Site Reliability Engineering
Next
Next Chapter
Praise for Site Reliability Engineering