sreworkbook-templates-md - A collection templates ported from the SRE Workbook

  •        10

This is a collection of ported Markdown templates included in "The Site Reliability Engineering Workbook" regarding the Service Level Objectives and Error Budget Policy documents. Full description of each section can be found in "The Site Reliability Engineering Workbook".

https://github.com/dastergon/sreworkbook-templates-md

Tags
Implementation
License
Platform

   




Related Projects

school-of-sre - At LinkedIn, we are using this curriculum for onboarding our entry-level talents into the SRE role

  •    HTML

Site Reliability Engineers (SREs) sits at the intersection of software engineering and systems engineering. While there are potentially infinite permutations and combinations of how infrastructure and software components can be put together to achieve an objective, focusing on foundational skills allows SREs to work with complex systems and software, regardless of whether these systems are proprietary, 3rd party, open systems, run on cloud/on-prem infrastructure, etc. Particularly important is to gain a deep understanding of how these areas of systems and infrastructure relate to each other and interact with each other. The combination of software and systems engineering skills is rare and is generally built over time with exposure to a wide variety of infrastructure, systems, and software. SREs bring in engineering practices to keep the site up. Each distributed system is an agglomeration of many components. SREs validate business requirements, convert them to SLAs for each of the components that constitute the distributed system, monitor and measure adherence to SLAs, re-architect or scale out to mitigate or avoid SLA breaches, add these learnings as feedback to new systems or projects and thereby reduce operational toil. Hence SREs play a vital role right from the day 0 design of the system.

awesome-scalability - Scalable, Available, Stable, Performant, and Intelligent System Design Patterns

  •    

An updated and curated list of readings to illustrate best practices and patterns in building scalable, available, stable, performant, and intelligent large-scale systems. Concepts are explained in the articles of prominent engineers and credible references. Case studies are taken from battle-tested systems that serve millions to billions of users. Understand your problems: scalability problem (fast for a single user but slow under heavy load) or performance problem (slow for a single user) by reviewing some design principles and checking how scalability and performance problems are solved at tech companies. The section of intelligence are created for those who work with data and machine learning at big (data) and deep (learning) scale.

cloud-ops-sandbox - Cloud Operations Sandbox is an open source tool that helps practitioners to learn Service Reliability Engineering practices from Google and apply them on their cloud services using Cloud Operations suite of tools

  •    HTML

Cloud Operations Sandbox is an open-source tool that helps practitioners to learn Service Reliability Engineering practices from Google and apply them on their cloud services using Cloud Operations (formerly Stackdriver). It is based on Hipster Shop, a cloud-native microservices application. Google Cloud Operations Suite is a suite of tools that helps you gain full observability of your code and applications. You might want to take Cloud Operations to a "test drive" in order to answer the question, "will it work for my application needs"? The most effective way to learn is by testing the tool in "real-life" conditions, but without risking a production system. With Sandbox, we provide a tool that automatically provisions a new demo cluster, which receives traffic, simulating real users. Practitioners can experiment with various Cloud Operations tools to solve problems and accomplish standard SRE tasks in a sandboxed environment.

awesome-chaos-engineering - A curated list of awesome Chaos Engineering resources.

  •    

A curated list of awesome Chaos Engineering resources. Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production. - Principles Of Chaos Engineering website.


Litmus - Cloud-Native Chaos Engineering

  •    Go

Litmus is a toolset to do cloud-native chaos engineering. Litmus provides tools to orchestrate chaos on Kubernetes to help SREs find weaknesses in their deployments. SREs use Litmus to run chaos experiments initially in the staging environment and eventually in production to find bugs, vulnerabilities. Fixing the weaknesses leads to increased resilience of the system.

sloth - 🦥 Easy and simple Prometheus SLO (service level objectives) generator

  •    Go

Meet the easiest way to generate SLOs for Prometheus. Sloth generates understandable, uniform and reliable Prometheus SLOs for any kind of service. Using a simple SLO spec that results in multiple metrics and multi window multi burn alerts.

que - A Ruby job queue that uses PostgreSQL's advisory locks for speed and reliability.

  •    Ruby

TL;DR: Que is a high-performance alternative to DelayedJob or QueueClassic that improves the reliability of your application by protecting your jobs with the same ACID guarantees as the rest of your data. Que's primary goal is reliability. You should be able to leave your application running indefinitely without worrying about jobs being lost due to a lack of transactional support, or left in limbo due to a crashing process. Que does everything it can to ensure that jobs you queue are performed exactly once (though the occasional repetition of a job can be impossible to avoid - see the docs on how to write a reliable job).

sentinel-golang - Sentinel Go version (Reliability & Resilience)

  •    Go

As distributed systems become increasingly popular, the reliability between services is becoming more important than ever before. Sentinel takes "flow" as breakthrough point, and works on multiple fields including flow control, circuit breaking and system adaptive protection, to guarantee reliability and resiliency of microservices. See the 中文文档 for document in Chinese.

Elementary - Data observability platform for modern data teams that is open and transparent

  •    Python

Elementary was built out of the need to effortlessly and immediately gain visibility into the data stack, starting with tracing the actual upstream & downstream dependencies in the data warehouse, without any implementation efforts, security risks or compromises on accuracy.

Web Application Reliability and Defense

  •    Java

The Web Application Reliability and Defense (WARD) framework is a two-part security solution composed of a vulnerability detection component, SecureUnit, and a vulnerability protection component, SecureFilter.

hemera - 🔬 Writing reliable & fault-tolerant microservices with https://nats.io

  •    Javascript

Hemera (/ˈhɛmərə/; Ancient Greek: Ἡμέρα [hɛːméra] "day") is a small wrapper around the NATS driver. NATS is a simple, fast and reliable solution for the internal communication of a distributed system. It chooses simplicity and reliability over guaranteed delivery. We want to provide a toolkit to develop micro services in an easy and powerful way. We provide a pattern matching RPC style. You don't have to worry about the transport. NATS is powerful.Hemera has not been designed for high performance on a single process. It has been designed to create lots of microservices doesn't matter where they live. It choose simplicity and reliability as primary goals. It act together with NATS as central nervous system of your distributed system. Transport independency was not considered to be a relevant factor. In addition we use pattern matching which is very powerful. The fact that Hemera needs a broker is an argument which should be taken into consideration when you compare hemera with other frameworks. The relevant difference between microservice frameworks like senecajs, molecurer is not the performance or modularity its about the complexity you need to manage. Hemera is expert in providing an interface to work with lots of services in the network, NATS is the expert to deliver the message at the right place. Hemera is still a subscriber of NATS with some magic in routing and extensions. We don't have to worry about all different aspects in a distributed system like routing, load-balancing, service-discovery, clustering, health-checks ...

Good Enough Reliability Tool

  •    

An open source tool based on easy-to-measure internal metrics to provide an empirical estimate of reliability and to provide feedback to developers on the thoroughness of their testing effort relative to prior successful comparable projects.

hn-android - Hacker News client with a focus on reliability and usability.

  •    Java

This is the official repo for HN, an unofficial Hacker News client for Android, built for reliability and usability. Download the app here: https://play.google.com/store/apps/details?id=com.manuelmaly.hn and read the introductory blog post. If you find any issues, please post them into the Issues section, send a Pull request, or tweet me at @manuelmaly.

Griddb - High performance, High scalability and High reliability database for big data

  •    C++

GridDB is an In-Memory NoSQL Database for highly scalable IoT applications . It has a KVS (Key-Value Store)-type data model that is suitable for sensor data stored in a timeseries. It is a database that can be easily scaled-out according to the number of sensors. High Reliability It is equipped with a structure to spread out the replication of key value data among fellow nodes so that in the event of a node failure, automatic failover can be carried out in a matter of seconds by using the replication function of other nodes.

Co-dfns - High-performance, Reliable, and Parallel APL

  •    APL

The Co-dfns project aims to provide a high-performance, high-reliability compiler for a parallel extension of the Dyalog dfns programming language. The dfns language is a functionally oriented, lexically scoped dialect of APL. The Co-dfns language extends the dfns language to include explicit task parallelism with implicit structures for synchronization and determinism. The language is designed to enable rigorous formal analysis of programs to aid in compiler optimization and programmer productivity, as well as in the general reliability of the code itself. Our mission is to deliver scalable APL programming to information and domain experts across many fields, expanding the scope and capabilities of what you can effectively accomplish with APL.

tla-rust - writing correct lock-free and distributed stateful systems in Rust, assisted by TLA+

  •    TLA

Stable stateful systems through modeling, linear types and simulation. I like to use things that wake me up at 4am as rarely as possible. Unfortunately, infrastructure vendors don't focus on reliability. Even if a company gives reliability lip service, it's unlikely that they use techniques like modeling or simulation to create a rock-solid core. Let's just build an open-source distributed store that takes correctness seriously at the local storage, sharding, and distributed transactional layers.

elasticell - Elastic Key-Value Storage With Strong Consistency and Reliability

  •    Go

Elasticell is a distributed NoSQL database with strong consistency and reliability. Compatible with Redis protocol Use Elasticell as Redis. You can replace Redis with Elasticell to power your application without changing a single line of code in most cases(unsupport-redis-commands).

httpx - httpx is a fast and multi-purpose HTTP toolkit allows to run multiple probers using retryablehttp library, it is designed to maintain the result reliability with increased threads

  •    Go

httpx is a fast and multi-purpose HTTP toolkit allow to run multiple probers using retryablehttp library, it is designed to maintain the result reliability with increased threads. This will display help for the tool. Here are all the switches it supports.






We have large collection of open source products. Follow the tags from Tag Cloud >>


Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.