awesome-distributed-systems - A curated list to learn about distributed systems

  •        22

A (hopefully) curated list on awesome material on distributed systems, inspired by other awesome frameworks like awesome-python. Most links will tend to be readings on architecture itself rather than code itself. Read things here before you start.

https://github.com/theanalyst/awesome-distributed-systems

Tags
Implementation
License
Platform

   




Related Projects

Atomix - Scalable, fault-tolerant distributed systems protocols and primitives for the JVM

  •    Java

Atomix is an event-driven framework for coordinating fault-tolerant distributed systems built on the Raft consensus algorithm. It provides the building blocks that solve many common distributed systems problems including group membership, leader election, distributed concurrency control, partitioning, and replication.

dragonboat - A feature complete and high performance multi-group Raft library in Go.

  •    Go

Dragonboat is a high performance multi-group Raft consensus library in Go with C++11 binding support. Consensus algorithms such as Raft provides fault-tolerance by alllowing a system continue to operate as long as the majority member servers are available. For example, a Raft cluster of 5 servers can make progress even if 2 servers fail. It also appears to clients as a single node with strong data consistency always provided. All running servers can be used to initiate read requests for aggregated read throughput.

awesome-consensus - Awesome list for Paxos and friends

  •    

A curated selection of artisanal consensus algorithms and hand-crafted distributed lock services.

braft - An industrial-grade C++ implementation of RAFT consensus algorithm based on brpc, widely used inside Baidu to build highly-available distributed systems

  •    C++

An industrial-grade C++ implementation of RAFT consensus algorithm and replicated state machine based on brpc. braft is designed and implemented for scenarios demanding for high workload and low overhead of latency, with the consideration for easy-to-understand concepts so that engineers inside Baidu can build their own distributed systems individually and correctly. Build brpc which is the main dependency of braft.


raft-rs - Raft distributed consensus algorithm implemented in Rust.

  •    Rust

When building a distributed system one principal goal is often to build in fault-tolerance. That is, if one particular node in a network goes down, or if there is a network partition, the entire cluster does not fall over. The cluster of nodes taking part in a distributed consensus protocol must come to agreement regarding values, and once that decision is reached, that choice is final. Distributed Consensus Algorithms often take the form of a replicated state machine and log. Each state machine accepts inputs from its log, and represents the value(s) to be replicated, for example, a hash table. They allow a collection of machines to work as a coherent group that can survive the failures of some of its members.

hraftd - A reference use of Hashicorp's Raft implementation

  •    Go

For background on this project check out this blog post. hraftd is a reference example use of the Hashicorp Raft implementation v1.0. Raft is a distributed consensus protocol, meaning its purpose is to ensure that a set of nodes -- a cluster -- agree on the state of some arbitrary state machine, even when nodes are vulnerable to failure and network partitions. Distributed consensus is a fundamental concept when it comes to building fault-tolerant systems.

Copycat - A novel implementation of the Raft consensus algorithm

  •    Java

Copycat is a fault-tolerant state machine replication framework. Built on the Raft consensus algorithm, it handles replication and persistence and enforces strict ordering of inputs and outputs, allowing developers to focus on single-threaded application logic. Its event-driven model allows for efficient client communication with replicated state machines, from simple key-value stores to wait-free locks and leader elections. You supply the state machine and Copycat takes care of the rest, making it easy to build robust, safe distributed systems.

Tendermint - Tendermint Core (BFT Consensus) in Go

  •    Go

Tendermint Core is Byzantine Fault Tolerant (BFT) middleware that takes a state transition machine - written in any programming language - and securely replicates it on many machines.

rafter - An Erlang library application which implements the Raft consensus protocol

  •    Erlang

Rafter is not production ready. It hasn't been tested or run in any production environments. Furthermore, the log is entirely my own creation. It's possible that it isn't as safe or efficient as many other log structured backends like leveldb or rocksdb. The protocol is solid and well tested, and it works as expected in most cases. However, support is not fully guaranteed as I don't have a current use case for rafter and am not actively developing it. I will however do my best to respond to reports and fix bugs in an efficient manner.Rafter is more than just an erlang implementation of the raft consensus protocol. It aims to take the pain away from building Consistent(2F+1 CP) distributed systems, as well as act as a library for leader election and routing. A main goal is to keep a very small user api that automatically handles the problems of the everyday Erlang distributed systems developer. It is hopefully your libPaxos.dll for erlang.

awesome-scalability - Scalable, Available, Stable, Performant, and Intelligent System Design Patterns

  •    

An updated and curated list of readings to illustrate best practices and patterns in building scalable, available, stable, performant, and intelligent large-scale systems. Concepts are explained in the articles of prominent engineers and credible references. Case studies are taken from battle-tested systems that serve millions to billions of users. Understand your problems: scalability problem (fast for a single user but slow under heavy load) or performance problem (slow for a single user) by reviewing some design principles and checking how scalability and performance problems are solved at tech companies. The section of intelligence are created for those who work with data and machine learning at big (data) and deep (learning) scale.

rqlite - The lightweight, distributed relational database built on SQLite.

  •    Go

rqlite is a distributed relational database, which uses SQLite as its storage engine. rqlite uses Raft to achieve consensus across all the instances of the SQLite databases, ensuring that every change made to the system is made to a quorum of SQLite databases, or none at all. It also gracefully handles leader elections, and tolerates failures of machines, including the leader. rqlite is available for Linux, OSX, and Microsoft Windows.rqlite gives you the functionality of a rock solid, fault-tolerant, replicated relational database, but with very easy installation, deployment, and operation. With it you've got a lightweight and reliable distributed relational data store. Think etcd or Consul, but with relational data modelling also available.

etcd - Distributed reliable key-value store for the most critical data of a distributed system

  •    Go

Note: The master branch may be in an unstable or even broken state during development. Please use releases instead of the master branch in order to get stable binaries. etcd is written in Go and uses the Raft consensus algorithm to manage a highly-available replicated log.

logcabin - LogCabin is a distributed storage system built on Raft that provides a small amount of highly replicated, consistent storage

  •    C++

LogCabin is a distributed system that provides a small amount of highly replicated, consistent storage. It is a reliable place for other distributed systems to store their core metadata and is helpful in solving cluster management issues. LogCabin uses the Raft consensus algorithm internally and is actually the very first implementation of Raft. It's released under the ISC license (equivalent to BSD).Information about releases is in RELEASES.md.

raft - Golang implementation of the Raft consensus protocol

  •    Go

raft is a Go library that manages a replicated log and can be used with an FSM to manage replicated state machines. It is a library for providing consensus.The use cases for such a library are far-reaching as replicated state machines are a key component of many distributed systems. They enable building Consistent, Partition Tolerant (CP) systems, with limited fault tolerance as well.

harvey - A distributed operating system

  •    C

Welcome to Harvey, we are delighted that you are interested in the project. Harvey is a distributed operating system. It’s most directly descended from Plan 9 from Bell Labs. This heritage spans from its distributed application architecture all the way down to much of its code. However, Harvey aims to be a more practical, general-purpose operating system, so it also uses ideas and code from other systems.

swarmkit - A toolkit for orchestrating distributed systems at any scale

  •    Go

SwarmKit is a toolkit for orchestrating distributed systems at any scale. It includes primitives for node discovery, raft-based consensus, task scheduling and more. Machines running SwarmKit can be grouped together in order to form a Swarm, coordinating tasks with each other. Once a machine joins, it becomes a Swarm Node. Nodes can either be worker nodes or manager nodes.

testing-distributed-systems - Curated list of resources on testing distributed systems

  •    HTML

List of resources on testing distributed systems curated by Andrey Satarin (@asatarin). Colin Scott shares his viewpoint from academia on testing distributed systems, specifically regression testing for correctness and performance bugs.

dist-prog-book

  •    CSS

Source repo for the book that I and my students in my course at Northeastern University, CS7680 Special Topics in Computing Systems: Programming Models for Distributed Computing, are writing on the topic of programming models for distributed systems. This is a book about the programming constructs we use to build distributed systems. These range from the small, RPC, futures, actors, to the large; systems built up of these components like MapReduce and Spark. We explore issues and concerns central to distributed systems like consistency, availability, and fault tolerance, from the lens of the programming models and frameworks that the programmer uses to build these systems.