This is a collection of postmortem templates derived from various sources such as the Site Reliability Engineering book, The Practice of Cloud System Administration book and other online resources. It is possible to load the postmortem templates automatically without copy pasting from the files or manually writing the structure every time you want to author an incident report.
https://github.com/dastergon/postmortem-templatesTags | site-reliability-engineering site-reliability devops postmortem incident-reports post-mortem |
Implementation | |
License | creative-commons |
Platform |
A curated list of awesome Site Reliability and Production Engineering resources.
site-reliability-engineering production availability monitoring post-mortem reliability-engineering capacity-planning service-level-agreement scalability reliability alerting on-call site-reliability postmortem incident-response sre awesome awesome-list devops observabilityAn updated and curated list of readings to illustrate best practices and patterns in building scalable, available, stable, performant, and intelligent large-scale systems. Concepts are explained in the articles of prominent engineers and credible references. Case studies are taken from battle-tested systems that serve millions to billions of users. Understand your problems: scalability problem (fast for a single user but slow under heavy load) or performance problem (slow for a single user) by reviewing some design principles and checking how scalability and performance problems are solved at tech companies. The section of intelligence are created for those who work with data and machine learning at big (data) and deep (learning) scale.
system-design backend scalability site-reliability-engineering sre interview architecture devops site-reliability design-patterns back-end back-end-development interview-questions design-systems awesome-list microservices distributed-systems design-system tech big-dataCloud Operations Sandbox is an open-source tool that helps practitioners to learn Service Reliability Engineering practices from Google and apply them on their cloud services using Cloud Operations (formerly Stackdriver). It is based on Hipster Shop, a cloud-native microservices application. Google Cloud Operations Suite is a suite of tools that helps you gain full observability of your code and applications. You might want to take Cloud Operations to a "test drive" in order to answer the question, "will it work for my application needs"? The most effective way to learn is by testing the tool in "real-life" conditions, but without risking a production system. With Sandbox, we provide a tool that automatically provisions a new demo cluster, which receives traffic, simulating real users. Practitioners can experiment with various Cloud Operations tools to solve problems and accomplish standard SRE tasks in a sandboxed environment.
debugger devops cloud profiler google-cloud samples operations cloud-native sre stackdriver cloud-operations stackdriver-monitoring cloudops opencensus stackdriver-logs stackdriver-trace opentelemetry ops-management stackdriver-sandboxSite Reliability Engineers (SREs) sits at the intersection of software engineering and systems engineering. While there are potentially infinite permutations and combinations of how infrastructure and software components can be put together to achieve an objective, focusing on foundational skills allows SREs to work with complex systems and software, regardless of whether these systems are proprietary, 3rd party, open systems, run on cloud/on-prem infrastructure, etc. Particularly important is to gain a deep understanding of how these areas of systems and infrastructure relate to each other and interact with each other. The combination of software and systems engineering skills is rare and is generally built over time with exposure to a wide variety of infrastructure, systems, and software. SREs bring in engineering practices to keep the site up. Each distributed system is an agglomeration of many components. SREs validate business requirements, convert them to SLAs for each of the components that constitute the distributed system, monitor and measure adherence to SLAs, re-architect or scale out to mitigate or avoid SLA breaches, add these learnings as feedback to new systems or projects and thereby reduce operational toil. Hence SREs play a vital role right from the day 0 design of the system.
mysql git security networking hadoop nosql sre system-designA curated list of awesome Chaos Engineering resources. Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production. - Principles Of Chaos Engineering website.
netflix-chaos-monkey chaos-engineering chaos-monkey simian-army site-reliability-engineering resilience chaos chaos-community chaos-testing awesome awesome-listLitmus is a toolset to do cloud-native chaos engineering. Litmus provides tools to orchestrate chaos on Kubernetes to help SREs find weaknesses in their deployments. SREs use Litmus to run chaos experiments initially in the staging environment and eventually in production to find bugs, vulnerabilities. Fixing the weaknesses leads to increased resilience of the system.
kubernetes golang ansible microservices site-reliability-engineering cncf operator cloud-native fault-injection hacktoberfest litmus kubernetes-resources chaos-testing chaos-engineering crd operator-sdk chaoshub chaos-experimentsThis readme and related documentation are Work in Progress. Controller-manager: used to schedule and manage the lifecycle of CRD objects.
kubernetes microservice site-reliability-engineering cncf chaos operator cloud-native fault-injection hacktoberfest chaos-testing chaos-engineering crd chaos-experiments chaos-meshThis is the official repo for HN, an unofficial Hacker News client for Android, built for reliability and usability. Download the app here: https://play.google.com/store/apps/details?id=com.manuelmaly.hn and read the introductory blog post. If you find any issues, please post them into the Issues section, send a Pull request, or tweet me at @manuelmaly.
This repository includes thousands of cybersecurity-related references and resources and it is maintained by Omar Santos. This GitHub repository has been created to provide supplemental material to several books, video courses, and live training created by Omar Santos and other co-authors. It provides over 6,000 references, scripts, tools, code, and other resources that help offensive and defensive security professionals learn and develop new skills. This GitHub repository provides guidance on how build your own hacking environment, learn about offensive security (ethical hacking) techniques, vulnerability research, exploit development, reverse engineering, malware analysis, threat intelligence, threat hunting, digital forensics and incident response (DFIR), includes examples of real-life penetration testing reports, and more. These courses serve as comprehensive guide for any network and security professional who is starting a career in ethical hacking and penetration testing. It also can help individuals preparing for the Offensive Security Certified Professional (OSCP), the Certified Ethical Hacker (CEH), CompTIA PenTest+ and any other ethical hacking certification. This course helps any cyber security professional that want to learn the skills required to becoming a professional ethical hacker or that want to learn more about general hacking methodologies and concepts.
hacking penetration-testing hacking-series video-course cybersecurity ethical-hacking ethicalhacking hacker exploit exploits exploit-development vulnerability vulnerability-scanners vulnerability-assessment vulnerability-management vulnerability-identification awesome-lists awesome-list training hackersTL;DR: Que is a high-performance alternative to DelayedJob or QueueClassic that improves the reliability of your application by protecting your jobs with the same ACID guarantees as the rest of your data. Que's primary goal is reliability. You should be able to leave your application running indefinitely without worrying about jobs being lost due to a lack of transactional support, or left in limbo due to a crashing process. Que does everything it can to ensure that jobs you queue are performed exactly once (though the occasional repetition of a job can be impossible to avoid - see the docs on how to write a reliable job).
As distributed systems become increasingly popular, the reliability between services is becoming more important than ever before. Sentinel takes "flow" as breakthrough point, and works on multiple fields including flow control, circuit breaking and system adaptive protection, to guarantee reliability and resiliency of microservices. See the 中文文档 for document in Chinese.
microservice rate-limiting resiliency cloud-nativeHandset Detection for Mobile Applications. Device Description Database
easyXDM is a Javascript library that enables you as a developer to easily work around the limitation set in place by the Same Origin Policy, in turn making it easy to communicate and expose javascript API's across domain boundaries. At the core easyXDM provides a transport stack capable of passing string based messages between two windows, a consumer (the main document) and a provider (a document included using an iframe). It does this by using one of several available techniques, always selecting the most efficient one for the current browser. For all implementations the transport stack offers bi-directionality, reliability, queueing and sender-verification.
Get the rocks out of your socks! Assemble makes you fast at creating web projects. Assemble is used by thousands of projects for rapid prototyping, creating themes, scaffolds, boilerplates, e-books, UI components, API documentation, blogs, building websites / static site generator, alternative to jekyll for gh-pages and more! Assemble can also be used with gulp and grunt.
template-engine build static-site static-site-generator templates scaffold generator scaffolder base toolkit app update generate verb node nodejs gulp grunt assemble async-helper blog boilerplate boilerplates bootstrap builder collection compile component content docs document documentation engines handlebars helper helpers html inflections jekyll layout lodash markdown md page partial post pug render scaffolding site smith static symlink task tasks template templating view views website yamlLimon is a sandbox developed as a research project written in python, which automatically collects, analyzes, and reports on the run time indicators of Linux malware. It allows one to inspect the Linux malware before execution, during execution, and after execution (post-mortem analysis) by performing static, dynamic and memory analysis using open source tools. Limon analyzes the malware in a controlled environment, monitors its activities and its child processes to determine the nature and purpose of the malware. It determines the malware's process activity, interaction with the file system, network, it also performs memory analysis and stores the analyzed artifacts for later analysis.
Dealing with incidents can be stressful. On top of dealing with the issue at hand, responders are often responsible for handling comms, coordinating the efforts of other engineers, and reporting what happened after the fact. Monzo built Response to help reduce the pressure and cognitive burden on engineers during an incident, and to make it easy to create information rich reports for others to learn from.
incident response incident-response incident-management incident-reports slack-botElementary was built out of the need to effortlessly and immediately gain visibility into the data stack, starting with tracing the actual upstream & downstream dependencies in the data warehouse, without any implementation efforts, security risks or compromises on accuracy.
bigquery snowflake data-warehouse dataops data-analysis data-pipelines data-pipeline lineage data-governance data-lineage data-observability analaytics-engineering data-reliabilityThe Web Application Reliability and Defense (WARD) framework is a two-part security solution composed of a vulnerability detection component, SecureUnit, and a vulnerability protection component, SecureFilter.
Hemera (/ˈhɛmərə/; Ancient Greek: Ἡμέρα [hɛːméra] "day") is a small wrapper around the NATS driver. NATS is a simple, fast and reliable solution for the internal communication of a distributed system. It chooses simplicity and reliability over guaranteed delivery. We want to provide a toolkit to develop micro services in an easy and powerful way. We provide a pattern matching RPC style. You don't have to worry about the transport. NATS is powerful.Hemera has not been designed for high performance on a single process. It has been designed to create lots of microservices doesn't matter where they live. It choose simplicity and reliability as primary goals. It act together with NATS as central nervous system of your distributed system. Transport independency was not considered to be a relevant factor. In addition we use pattern matching which is very powerful. The fact that Hemera needs a broker is an argument which should be taken into consideration when you compare hemera with other frameworks. The relevant difference between microservice frameworks like senecajs, molecurer is not the performance or modularity its about the complexity you need to manage. Hemera is expert in providing an interface to work with lots of services in the network, NATS is the expert to deliver the message at the right place. Hemera is still a subscriber of NATS with some magic in routing and extensions. We don't have to worry about all different aspects in a distributed system like routing, load-balancing, service-discovery, clustering, health-checks ...
nodejs distributed-systems microservice nats pubsub rpc cloud-native micro service micro-service microservices micro-services services framework minimum viable product toolkit startup messaging publish subscribe queue distributed queueingAn open source tool based on easy-to-measure internal metrics to provide an empirical estimate of reliability and to provide feedback to developers on the thoroughness of their testing effort relative to prior successful comparable projects.
We have large collection of open source products. Follow the tags from
Tag Cloud >>
Open source products are scattered around the web. Please provide information
about the open source projects you own / you use.
Add Projects.