dkron - Dkron - Distributed, fault tolerant job scheduling system

  •        119

Dkron is written in Go and leverage the power of distributed key-value stores and serf for providing fault tolerance, reliability and scalability while keeping simple and easily instalable. Dkron is inspired by the google whitepaper Reliable Cron across the Planet and by Airbnb Chronos borrowing the same features from it.



Related Projects

chronos - Fault tolerant job scheduler for Mesos which handles dependencies and ISO8601 based schedules

  •    Scala

Chronos is a replacement for cron. It is a distributed and fault-tolerant scheduler that runs on top of Apache Mesos that can be used for job orchestration. It supports custom Mesos executors as well as the default command executor. Thus by default, Chronos executes sh (on most systems bash) scripts. Chronos can be used to interact with systems such as Hadoop (incl. EMR), even if the Mesos agents on which execution happens do not have Hadoop installed. Included wrapper scripts allow transfering files and executing them on a remote machine in the background and using asynchronous callbacks to notify Chronos of job completion or failures. Chronos is also natively able to schedule jobs that run inside Docker containers.

cronsun - A Distributed, Fault-Tolerant Cron-Style Job System.

  •    Go

cronsun is a distributed cron-style job system. It's similar with crontab on stand-alone *nix. The goal of this project is to make it much easier to manage jobs on lots of machines and provides high availability. cronsun is different from Azkaban, Chronos, Airflow.

Hystrix - Hystrix is a latency and fault tolerance library designed to isolate points of access to remote systems, services and 3rd party libraries, stop cascading failure and enable resilience in complex distributed systems where failure is inevitable

  •    Java

Hystrix is a latency and fault tolerance library designed to isolate points of access to remote systems, services and 3rd party libraries, stop cascading failure and enable resilience in complex distributed systems where failure is inevitable.See the Wiki for full documentation, examples, operational details and other information.


  •    Elixir

Elixir provides the joy and productivity of Ruby with the concurrency and fault-tolerance of Erlang. Together, we'll take a guided tour through the language, going from the very basics to macros and distributed programming. Along the way, we'll see how Elixir embraces concurrency and how we can construct self-healing programs that restart automatically on failure. Attendees should leave with a great head-start into Elixir, some minor language envy, and a strong desire to continue exploration. Erlang is a functional language with focuses on concurrency and fault tolerance. It first appeared in 1986 as part of Ericsson's quest to build fault tolerant, scalable telecommunication systems. It was open-sourced in 1998 and has gone on to power large portions of the worlds telecom systems, in addition to some of the most popular online services.

Tendermint - Tendermint Core (BFT Consensus) in Go

  •    Go

Tendermint Core is Byzantine Fault Tolerant (BFT) middleware that takes a state transition machine - written in any programming language - and securely replicates it on many machines.

hystrix-go - Netflix's Hystrix latency and fault tolerance library, for Go

  •    Go

Hystrix is a great project from Netflix.Hystrix is a latency and fault tolerance library designed to isolate points of access to remote systems, services and 3rd party libraries, stop cascading failure and enable resilience in complex distributed systems where failure is inevitable.


  •    CSS

Source repo for the book that I and my students in my course at Northeastern University, CS7680 Special Topics in Computing Systems: Programming Models for Distributed Computing, are writing on the topic of programming models for distributed systems. This is a book about the programming constructs we use to build distributed systems. These range from the small, RPC, futures, actors, to the large; systems built up of these components like MapReduce and Spark. We explore issues and concerns central to distributed systems like consistency, availability, and fault tolerance, from the lens of the programming models and frameworks that the programmer uses to build these systems.

distributed-consensus-reading-list - List of academic papers on distributed consensus


This markdown file contains a list of academic papers (and other works) in the field of distributed consensus. Many of the papers listed below fit into more than one section. However, for simplicity, each paper is listed only in the most relevant section. Where possible, open access links for each paper have been provided. Contributions are welcome. This section lists theoretical results relating to distributed consensus.

sidekiq-cron - Scheduler / Cron for Sidekiq jobs

  •    Ruby

A scheduling add-on for Sidekiq. Runs a thread alongside Sidekiq workers to schedule jobs at specified times (using cron notation * * * * * parsed by Rufus-Scheduler, more about cron notation.

Jepsen - A framework for distributed systems verification, with fault injection

  •    Clojure

Jepsen is a Clojure library. A test is a Clojure program which uses the Jepsen library to set up a distributed system, run a bunch of operations against that system, and verify that the history of those operations makes sense. Jepsen has been used to verify everything from eventually-consistent commutative databases to linearizable coordination systems to distributed task schedulers. It can also generate graphs of performance and availability, helping you characterize how a system responds to different faults.

raft-rs - Raft distributed consensus algorithm implemented in Rust.

  •    Rust

When building a distributed system one principal goal is often to build in fault-tolerance. That is, if one particular node in a network goes down, or if there is a network partition, the entire cluster does not fall over. The cluster of nodes taking part in a distributed consensus protocol must come to agreement regarding values, and once that decision is reached, that choice is final. Distributed Consensus Algorithms often take the form of a replicated state machine and log. Each state machine accepts inputs from its log, and represents the value(s) to be replicated, for example, a hash table. They allow a collection of machines to work as a coherent group that can survive the failures of some of its members.

Akka - Build Concurrent and Scalable Applications

  •    Java

Akka is the platform for the next generation event-driven, scalable and fault-tolerant architectures on the JVM. It helps to write simpler correct concurrent applications using Actors, STM & Transactors. It could scale out on multi-core or multiple nodes using asynchronous message passing. For fault-tolerance it adopts the Let it crash or Embrace failure model to build applications that self-heals, systems that never stop.

Atomix - Scalable, fault-tolerant distributed systems protocols and primitives for the JVM

  •    Java

Atomix is an event-driven framework for coordinating fault-tolerant distributed systems built on the Raft consensus algorithm. It provides the building blocks that solve many common distributed systems problems including group membership, leader election, distributed concurrency control, partitioning, and replication.

Cassandra - Scalable Distributed Database

  •    Java

The Apache Cassandra Project develops a highly scalable second-generation distributed database, bringing together Dynamo's fully distributed design and Bigtable's ColumnFamily-based data model. Cassandra is suitable for applications that can't afford to lose data. Data is automatically replicated to multiple nodes for fault-tolerance.

dragonboat - A feature complete and high performance multi-group Raft library in Go.

  •    Go

Dragonboat is a high performance multi-group Raft consensus library in Go with C++11 binding support. Consensus algorithms such as Raft provides fault-tolerance by alllowing a system continue to operate as long as the majority member servers are available. For example, a Raft cluster of 5 servers can make progress even if 2 servers fail. It also appears to clients as a single node with strong data consistency always provided. All running servers can be used to initiate read requests for aggregated read throughput.

raft - Golang implementation of the Raft consensus protocol

  •    Go

raft is a Go library that manages a replicated log and can be used with an FSM to manage replicated state machines. It is a library for providing consensus.The use cases for such a library are far-reaching as replicated state machines are a key component of many distributed systems. They enable building Consistent, Partition Tolerant (CP) systems, with limited fault tolerance as well.

Bistro - Flexible distributed scheduler, a high-performance framework supporting multiple paradigms while retaining ease of configuration, management, and monitoring

  •    C++

Bistro is a toolkit for making services that schedule and execute tasks. Bistro is a toolkit for making distributed computation systems. It can schedule and run distributed tasks, including data-parallel jobs. It enforces resource constraints for worker hosts and data-access bottlenecks. It supports remote worker pools, low-latency batch scheduling, dynamic shards, and a variety of other possibilities. It has command-line and web UIs.



Hystrix.NET is a C# port of Hystrix, which is a latency and fault tolerance library for complex distributed systems.

incubator-gobblin - Gobblin is a distributed big data integration framework (ingestion, replication, compliance, retention) for batch and streaming systems

  •    Java

Apache Gobblin is a universal data ingestion framework for extracting, transforming, and loading large volume of data from a variety of data sources, e.g., databases, rest APIs, FTP/SFTP servers, filers, etc., onto Hadoop. Apache Gobblin handles the common routine tasks required for all data ingestion ETLs, including job/task scheduling, task partitioning, error handling, state management, data quality checking, data publishing, etc. Gobblin ingests data from different data sources in the same execution framework, and manages metadata of different sources all in one place. This, combined with other features such as auto scalability, fault tolerance, data quality assurance, extensibility, and the ability of handling data model evolution, makes Gobblin an easy-to-use, self-serving, and efficient data ingestion framework.