postmortem-templates - A collection of postmortem templates

  •        95

This is a collection of postmortem templates derived from various sources such as the Site Reliability Engineering book, The Practice of Cloud System Administration book and other online resources. It is possible to load the postmortem templates automatically without copy pasting from the files or manually writing the structure every time you want to author an incident report.

https://github.com/dastergon/postmortem-templates

Tags
Implementation
License
Platform

   




Related Projects

awesome-scalability - Scalable, Available, Stable, Performant, and Intelligent System Design Patterns

  •    

An updated and curated list of readings to illustrate best practices and patterns in building scalable, available, stable, performant, and intelligent large-scale systems. Concepts are explained in the articles of prominent engineers and credible references. Case studies are taken from battle-tested systems that serve millions to billions of users. Understand your problems: scalability problem (fast for a single user but slow under heavy load) or performance problem (slow for a single user) by reviewing some design principles and checking how scalability and performance problems are solved at tech companies. The section of intelligence are created for those who work with data and machine learning at big (data) and deep (learning) scale.

cloud-ops-sandbox - Cloud Operations Sandbox is an open source tool that helps practitioners to learn Service Reliability Engineering practices from Google and apply them on their cloud services using Cloud Operations suite of tools

  •    HTML

Cloud Operations Sandbox is an open-source tool that helps practitioners to learn Service Reliability Engineering practices from Google and apply them on their cloud services using Cloud Operations (formerly Stackdriver). It is based on Hipster Shop, a cloud-native microservices application. Google Cloud Operations Suite is a suite of tools that helps you gain full observability of your code and applications. You might want to take Cloud Operations to a "test drive" in order to answer the question, "will it work for my application needs"? The most effective way to learn is by testing the tool in "real-life" conditions, but without risking a production system. With Sandbox, we provide a tool that automatically provisions a new demo cluster, which receives traffic, simulating real users. Practitioners can experiment with various Cloud Operations tools to solve problems and accomplish standard SRE tasks in a sandboxed environment.

school-of-sre - At LinkedIn, we are using this curriculum for onboarding our entry-level talents into the SRE role

  •    HTML

Site Reliability Engineers (SREs) sits at the intersection of software engineering and systems engineering. While there are potentially infinite permutations and combinations of how infrastructure and software components can be put together to achieve an objective, focusing on foundational skills allows SREs to work with complex systems and software, regardless of whether these systems are proprietary, 3rd party, open systems, run on cloud/on-prem infrastructure, etc. Particularly important is to gain a deep understanding of how these areas of systems and infrastructure relate to each other and interact with each other. The combination of software and systems engineering skills is rare and is generally built over time with exposure to a wide variety of infrastructure, systems, and software. SREs bring in engineering practices to keep the site up. Each distributed system is an agglomeration of many components. SREs validate business requirements, convert them to SLAs for each of the components that constitute the distributed system, monitor and measure adherence to SLAs, re-architect or scale out to mitigate or avoid SLA breaches, add these learnings as feedback to new systems or projects and thereby reduce operational toil. Hence SREs play a vital role right from the day 0 design of the system.

awesome-chaos-engineering - A curated list of awesome Chaos Engineering resources.

  •    

A curated list of awesome Chaos Engineering resources. Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production. - Principles Of Chaos Engineering website.


Litmus - Cloud-Native Chaos Engineering

  •    Go

Litmus is a toolset to do cloud-native chaos engineering. Litmus provides tools to orchestrate chaos on Kubernetes to help SREs find weaknesses in their deployments. SREs use Litmus to run chaos experiments initially in the staging environment and eventually in production to find bugs, vulnerabilities. Fixing the weaknesses leads to increased resilience of the system.

hn-android - Hacker News client with a focus on reliability and usability.

  •    Java

This is the official repo for HN, an unofficial Hacker News client for Android, built for reliability and usability. Download the app here: https://play.google.com/store/apps/details?id=com.manuelmaly.hn and read the introductory blog post. If you find any issues, please post them into the Issues section, send a Pull request, or tweet me at @manuelmaly.

h4cker - This repository is primarily maintained by Omar Santos and includes resources related to ethical hacking / penetration testing, digital forensics and incident response (DFIR), vulnerability research, exploit development, reverse engineering, and more

  •    Java

This repository includes thousands of cybersecurity-related references and resources and it is maintained by Omar Santos. This GitHub repository has been created to provide supplemental material to several books, video courses, and live training created by Omar Santos and other co-authors. It provides over 6,000 references, scripts, tools, code, and other resources that help offensive and defensive security professionals learn and develop new skills. This GitHub repository provides guidance on how build your own hacking environment, learn about offensive security (ethical hacking) techniques, vulnerability research, exploit development, reverse engineering, malware analysis, threat intelligence, threat hunting, digital forensics and incident response (DFIR), includes examples of real-life penetration testing reports, and more. These courses serve as comprehensive guide for any network and security professional who is starting a career in ethical hacking and penetration testing. It also can help individuals preparing for the Offensive Security Certified Professional (OSCP), the Certified Ethical Hacker (CEH), CompTIA PenTest+ and any other ethical hacking certification. This course helps any cyber security professional that want to learn the skills required to becoming a professional ethical hacker or that want to learn more about general hacking methodologies and concepts.

que - A Ruby job queue that uses PostgreSQL's advisory locks for speed and reliability.

  •    Ruby

TL;DR: Que is a high-performance alternative to DelayedJob or QueueClassic that improves the reliability of your application by protecting your jobs with the same ACID guarantees as the rest of your data. Que's primary goal is reliability. You should be able to leave your application running indefinitely without worrying about jobs being lost due to a lack of transactional support, or left in limbo due to a crashing process. Que does everything it can to ensure that jobs you queue are performed exactly once (though the occasional repetition of a job can be impossible to avoid - see the docs on how to write a reliable job).

sentinel-golang - Sentinel Go version (Reliability & Resilience)

  •    Go

As distributed systems become increasingly popular, the reliability between services is becoming more important than ever before. Sentinel takes "flow" as breakthrough point, and works on multiple fields including flow control, circuit breaking and system adaptive protection, to guarantee reliability and resiliency of microservices. See the 中文文档 for document in Chinese.

Wireless Universal Resource File

  •    Java

Handset Detection for Mobile Applications. Device Description Database

easyXDM - A javascript library providing cross-browser, cross-site messaging/method invocation.

  •    XSLT

easyXDM is a Javascript library that enables you as a developer to easily work around the limitation set in place by the Same Origin Policy, in turn making it easy to communicate and expose javascript API's across domain boundaries. At the core easyXDM provides a transport stack capable of passing string based messages between two windows, a consumer (the main document) and a provider (a document included using an iframe). It does this by using one of several available techniques, always selecting the most efficient one for the current browser. For all implementations the transport stack offers bi-directionality, reliability, queueing and sender-verification.

Limon - Limon is a sandbox developed as a research project written in python, which automatically collects, analyzes, and reports on the run time indicators of Linux malware

  •    Python

Limon is a sandbox developed as a research project written in python, which automatically collects, analyzes, and reports on the run time indicators of Linux malware. It allows one to inspect the Linux malware before execution, during execution, and after execution (post-mortem analysis) by performing static, dynamic and memory analysis using open source tools. Limon analyzes the malware in a controlled environment, monitors its activities and its child processes to determine the nature and purpose of the malware. It determines the malware's process activity, interaction with the file system, network, it also performs memory analysis and stores the analyzed artifacts for later analysis.

response - Monzo's real-time incident response and reporting tool

  •    Javascript

Dealing with incidents can be stressful. On top of dealing with the issue at hand, responders are often responsible for handling comms, coordinating the efforts of other engineers, and reporting what happened after the fact. Monzo built Response to help reduce the pressure and cognitive burden on engineers during an incident, and to make it easy to create information rich reports for others to learn from.

Elementary - Data observability platform for modern data teams that is open and transparent

  •    Python

Elementary was built out of the need to effortlessly and immediately gain visibility into the data stack, starting with tracing the actual upstream & downstream dependencies in the data warehouse, without any implementation efforts, security risks or compromises on accuracy.

Web Application Reliability and Defense

  •    Java

The Web Application Reliability and Defense (WARD) framework is a two-part security solution composed of a vulnerability detection component, SecureUnit, and a vulnerability protection component, SecureFilter.

hemera - 🔬 Writing reliable & fault-tolerant microservices with https://nats.io

  •    Javascript

Hemera (/ˈhɛmərə/; Ancient Greek: Ἡμέρα [hɛːméra] "day") is a small wrapper around the NATS driver. NATS is a simple, fast and reliable solution for the internal communication of a distributed system. It chooses simplicity and reliability over guaranteed delivery. We want to provide a toolkit to develop micro services in an easy and powerful way. We provide a pattern matching RPC style. You don't have to worry about the transport. NATS is powerful.Hemera has not been designed for high performance on a single process. It has been designed to create lots of microservices doesn't matter where they live. It choose simplicity and reliability as primary goals. It act together with NATS as central nervous system of your distributed system. Transport independency was not considered to be a relevant factor. In addition we use pattern matching which is very powerful. The fact that Hemera needs a broker is an argument which should be taken into consideration when you compare hemera with other frameworks. The relevant difference between microservice frameworks like senecajs, molecurer is not the performance or modularity its about the complexity you need to manage. Hemera is expert in providing an interface to work with lots of services in the network, NATS is the expert to deliver the message at the right place. Hemera is still a subscriber of NATS with some magic in routing and extensions. We don't have to worry about all different aspects in a distributed system like routing, load-balancing, service-discovery, clustering, health-checks ...

Good Enough Reliability Tool

  •    

An open source tool based on easy-to-measure internal metrics to provide an empirical estimate of reliability and to provide feedback to developers on the thoroughness of their testing effort relative to prior successful comparable projects.






We have large collection of open source products. Follow the tags from Tag Cloud >>


Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.