Table of Contents for
Site Reliability Engineering

Version ebook / Retour

Cover image for bash Cookbook, 2nd Edition Site Reliability Engineering by Jennifer Petoff Published by O'Reilly Media, Inc., 2016
  1. nav
  2. Cover
  3. Praise for Site Reliability Engineering
  4. Site Reliability Engineering
  5. Site Reliability Engineering
  6. Foreword
  7. Preface
  8. I. Introduction
  9. 1. Introduction
  10. 2. The Production Environment at Google, from the Viewpoint of an SRE
  11. II. Principles
  12. 3. Embracing Risk
  13. 4. Service Level Objectives
  14. 5. Eliminating Toil
  15. 6. Monitoring Distributed Systems
  16. 7. The Evolution of Automation at Google
  17. 8. Release Engineering
  18. 9. Simplicity
  19. III. Practices
  20. 10. Practical Alerting from Time-Series Data
  21. 11. Being On-Call
  22. 12. Effective Troubleshooting
  23. 13. Emergency Response
  24. 14. Managing Incidents
  25. 15. Postmortem Culture: Learning from Failure
  26. 16. Tracking Outages
  27. 17. Testing for Reliability
  28. 18. Software Engineering in SRE
  29. 19. Load Balancing at the Frontend
  30. 20. Load Balancing in the Datacenter
  31. 21. Handling Overload
  32. 22. Addressing Cascading Failures
  33. 23. Managing Critical State: Distributed Consensus for Reliability
  34. 24. Distributed Periodic Scheduling with Cron
  35. 25. Data Processing Pipelines
  36. 26. Data Integrity: What You Read Is What You Wrote
  37. 27. Reliable Product Launches at Scale
  38. IV. Management
  39. 28. Accelerating SREs to On-Call and Beyond
  40. 29. Dealing with Interrupts
  41. 30. Embedding an SRE to Recover from Operational Overload
  42. 31. Communication and Collaboration in SRE
  43. 32. The Evolving SRE Engagement Model
  44. V. Conclusions
  45. 33. Lessons Learned from Other Industries
  46. 34. Conclusion
  47. A. Availability Table
  48. B. A Collection of Best Practices for Production Services
  49. C. Example Incident State Document
  50. D. Example Postmortem
  51. E. Launch Coordination Checklist
  52. F. Example Production Meeting Minutes
  53. Bibliography
  54. Index
  55. About the Authors
  56. Colophon

Appendix F. Example Production Meeting Minutes

Date: 2015-10-23

Attendees: agoogler, clarac, docbrown, jennifer, martym

Announcements:

  • Major outage (#465), blew through error budget

Previous Action Item Review

  • Certify Goat Teleporter for use with cattle (bug 1011101)

    • Nonlinearities in mass acceleration now predictable, should be able to target accurately in a few days.

Outage Review

  • New Sonnet (outage 465)

    • 1.21B queries lost due to cascading failure after interaction between latent bug (leaked file descriptor on searches with no results) + not having new sonnet in corpus + unprecedented & unexpected traffic volume

    • File descriptor leak bug fixed (bug 5554825) and deployed to prod

    • Looking into using flux capacitor for load balancing (bug 5554823) and using load shedding (bug 5554826) to prevent recurrence

    • Annihilated availability error budget; pushes to prod frozen for 1 month unless docbrown can obtain exception on grounds that event was bizarre & unforeseeable (but consensus is that exception is unlikely)

Paging Events

  • AnnotationConsistencyTooEventual: paged 5 times this week, likely due to cross-regional replication delay between Bigtables.

    • Investigation still ongoing, see bug 4821600

    • No fix expected soon, will raise acceptable consistency threshold to reduce unactionable alerts

Nonpaging Events

  • None

Monitoring Changes and/or Silences

  • AnnotationConsistencyTooEventual, acceptable delay threshold raised from 60s to 180s, see bug 4821600; TODO(martym).

Planned Production Changes

  • USA-1 cluster going offline for maintenance between 2015-10-29 and 2015-11-02.

    • No response required, traffic will automatically route to other clusters in region.

Resources

  • Borrowed resources to respond to sonnet++ incident, will spin down additional server instances and return resources next week

  • Utilization at 60% of CPU, 75% RAM, 44% disk (up from 40%, 70%, 40% last week)

Key Service Metrics

  • OK 99ile latency: 88 ms < 100 ms SLO target [trailing 30 days]

  • BAD availability: 86.95% < 99.99% SLO target [trailing 30 days]

Discussion / Project Updates

  • Project Molière launching in two weeks.

New Action Items

  • TODO(martym): Raise AnnotationConsistencyTooEventual threshold.

  • TODO(docbrown): Return instance count to normal and return resources.