A curated list of awesome Site Reliability and Production Engineering resources.
site-reliability-engineering production availability monitoring post-mortem reliability-engineering capacity-planning service-level-agreement scalability reliability alerting on-call site-reliability postmortem incident-response sre awesome awesome-list devops observabilityAn updated and curated list of readings to illustrate best practices and patterns in building scalable, available, stable, performant, and intelligent large-scale systems. Concepts are explained in the articles of prominent engineers and credible references. Case studies are taken from battle-tested systems that serve millions to billions of users. Understand your problems: scalability problem (fast for a single user but slow under heavy load) or performance problem (slow for a single user) by reviewing some design principles and checking how scalability and performance problems are solved at tech companies. The section of intelligence are created for those who work with data and machine learning at big (data) and deep (learning) scale.
system-design backend scalability site-reliability-engineering sre interview architecture devops site-reliability design-patterns back-end back-end-development interview-questions design-systems awesome-list microservices distributed-systems design-system tech big-dataThis is a collection of postmortem templates derived from various sources such as the Site Reliability Engineering book, The Practice of Cloud System Administration book and other online resources. It is possible to load the postmortem templates automatically without copy pasting from the files or manually writing the structure every time you want to author an incident report.
site-reliability-engineering site-reliability devops postmortem incident-reports post-mortemCalculate how much downtime should be permitted in your Service Level Agreement or Objective.
calculator devops availability site-reliability-engineering service-level-agreement slo service-level-objective service-level-indicator sla chaos-engineering postmortem site-reliability service-levelThis is a list of common Disaster Recovery scenarios for software companies. It is nearly-impossible to cover all the scenarios that can happen. However, this list should include some common scenarios that can help companies kick-start their own set of policies.
security devops site-reliability-engineering disaster-recovery disaster-management chaos-engineering site-reliabilityThis is a collection of ported Markdown templates included in "The Site Reliability Engineering Workbook" regarding the Service Level Objectives and Error Budget Policy documents. Full description of each section can be found in "The Site Reliability Engineering Workbook".
devops reliability-engineering templates site-reliability-engineering slo sli sla site-reliability error-budget
We have large collection of open source products. Follow the tags from
Tag Cloud >>
Open source products are scattered around the web. Please provide information
about the open source projects you own / you use.
Add Projects.