Displaying 1 to 13 from 13 results

awesome-chaos-engineering - A curated list of awesome Chaos Engineering resources.

  •    

A curated list of awesome Chaos Engineering resources. Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production. - Principles Of Chaos Engineering website.

awesome-scalability - Scalable, Available, Stable, Performant, and Intelligent System Design Patterns

  •    

An updated and curated list of readings to illustrate best practices and patterns in building scalable, available, stable, performant, and intelligent large-scale systems. Concepts are explained in the articles of prominent engineers and credible references. Case studies are taken from battle-tested systems that serve millions to billions of users. Understand your problems: scalability problem (fast for a single user but slow under heavy load) or performance problem (slow for a single user) by reviewing some design principles and checking how scalability and performance problems are solved at tech companies. The section of intelligence are created for those who work with data and machine learning at big (data) and deep (learning) scale.

Litmus - Cloud-Native Chaos Engineering

  •    Go

Litmus is a toolset to do cloud-native chaos engineering. Litmus provides tools to orchestrate chaos on Kubernetes to help SREs find weaknesses in their deployments. SREs use Litmus to run chaos experiments initially in the staging environment and eventually in production to find bugs, vulnerabilities. Fixing the weaknesses leads to increased resilience of the system.




postmortem-templates - A collection of postmortem templates

  •    

This is a collection of postmortem templates derived from various sources such as the Site Reliability Engineering book, The Practice of Cloud System Administration book and other online resources. It is possible to load the postmortem templates automatically without copy pasting from the files or manually writing the structure every time you want to author an incident report.

skinny - The Skinny Distributed Lock Service

  •    Go

Skinny comes with few code dependencies. A Skinny instance is started by running skinnyd, preferably with the --config option.

sre-handbook - A combination of introduction to operating system and computer network

  •    

This handbook is a combination of introduction to operating system and computer network, which is useful for not only site reliability engineers, but also most programmers. I use it as a memo, and it can be used as a quick recap for preparing interview.


common-disaster-recovery-scenarios - A list of common Disaster Recovery (DR) scenarios for software companies

  •    

This is a list of common Disaster Recovery scenarios for software companies. It is nearly-impossible to cover all the scenarios that can happen. However, this list should include some common scenarios that can help companies kick-start their own set of policies.

sreworkbook-templates-md - A collection templates ported from the SRE Workbook

  •    

This is a collection of ported Markdown templates included in "The Site Reliability Engineering Workbook" regarding the Service Level Objectives and Error Budget Policy documents. Full description of each section can be found in "The Site Reliability Engineering Workbook".

wheel-of-misfortune - A role-playing game for incident management training

  •    HTML

Wheel of Misfortune is a game that aims to build confidence in on-call engineers via simulated outage scenarios. With the game, you practice problem debugging under stress, understanding the incident management protocol, and effective communication with other engineers of your team and organization. It is a great way to train new hires, interns, and seasoned engineers to become well-rounded on-call engineers. The game is inspired by the Site Reliability Engineering book.

sre-book-in-audio - Google Site Reliability Engineering book converted in audio

  •    

Google Site Reliability Engineering book converted in audio for those who want to save their fragment of time. I do not own the content and the output is simple crawling with Google tts api.






We have large collection of open source products. Follow the tags from Tag Cloud >>


Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.