Nomad - Easily Deploy Applications at Any Scale

  •        323

Nomad is a cluster manager, designed for both long lived services and short lived batch processing workloads. Developers use a declarative job specification to submit work, and Nomad ensures constraints are satisfied and resource utilization is optimized by efficient task packing. Nomad supports all major operating systems and virtualized, containerized, or standalone applications.
Jobs can specify tasks which are Docker containers. Nomad will automatically run the containers on clients which have Docker installed, scale up and down based on the number of instances requested, and automatically recover from failures. Nomad is designed to be a global-scale scheduler. Multiple datacenters can be managed as part of a larger region, and jobs can be scheduled across datacenters if requested. Multiple regions join together and federate jobs making it easy to run jobs anywhere.



Related Projects

cluster-scheduler-simulator - Automatically exported from code

This simulator can be used to prototype and compare different cluster scheduling strategies and policies. It generates synthetic cluster workloads from empirical parameter distributions (thus generating unique workloads even from a small amount of input data), simulates their scheduling and execution using a discrete event simulator, and finally permits analysis of scheduling performance metrics.While the simulator will simulate job arrival, scheduler decision making and task placement, it does not simulate the actual execution of the tasks or variation in their runtime due to shared resources.

hashi-ui - A modern user interface for @hashicorp Consul & Nomad

For Nomad, it was quite simple, no mobile-optimized, (somewhat) feature-complete and live-updating interface existed.Today the Consul and Nomad UI exist in the same binary, but do not "cross-talk" to each other, but long term goal is to integrate them even closer, so from Nomad Job UI you can see Consul health check status for the job tasks, and vice versa be able to cross-link between two otherwise distinct systems.

spark-cluster-deployment - Automates Spark standalone cluster tasks with Puppet and Fabric.

Apache Spark is a research project for distributed computing which interacts with HDFS and heavily utilizes in-memory caching. Spark 1.0.0 can be deployed to traditional cloud and job management services such as EC2, Mesos, or Yarn. Further, Spark's standalone cluster mode enables Spark to run on other servers without installing other job management services.However, configuring and submitting applications to a Spark 1.0.0 standalone cluster currently requires files to be synchronized across the entire cluster, including the Spark installation directory. This project utilizes Fabric and Puppet to further automate the Spark standalone cluster. The Puppet scripts are MIT-licensed from stefanvanwouw/puppet-spark and wikimedia/puppet-cdh4.

Kubernetes - Container Cluster Manager

Kubernetes is an open source orchestration system for Docker containers. It handles scheduling onto nodes in a compute cluster and actively manages workloads to ensure that their state matches the users declared intentions. Using the concepts of "labels" and "pods", it groups the containers which make up an application into logical units for easy management and discovery.

drmaa - Compute cluster (HPC) job submission library for Go (#golang) based on the open DRMAA standard

This is a job submission library for Go (#golang) which is compatible to the DRMAA standard. The Go library is a wrapper around the DRMAA C library implementation provided by many distributed resource managers (cluster schedulers).The library was developed using Univa Grid Engine's It was tested with Grid Engine and Torque, but it should work also other resource manager / cluster scheduler.

nomad-helper - Helper command to help ease working with HashiCorp's Nomad scheduler

nomad-helper is a tool meant to enable teams to quickly onboard themselves with nomad, by exposing scaling functionality in a simple to use and share yaml format.The project got build artifacts for linux, darwin and windows in the GitHub releases tab.

nomad-firehose - Firehose all nomad job, allocation, nodes and evaluations changes to rabbitmq, kinesis or stdout

nomad-firehose is a tool meant to enable teams to quickly build logic around nomad task events without hooking into Nomad API.The project got build artifacts for linux, darwin and windows in the GitHub releases tab.

Helix - Cluster Management Framework

Helix is a generic cluster management framework used for the automatic management of partitioned, replicated and distributed resources hosted on a cluster of nodes. It helps to perform scheduling of maintenance tasks, such as backups, garbage collection, file consolidation, index rebuilds, repartitioning of data or resources across the cluster, informing dependent systems of changes so they can react appropriately to cluster changes, throttling system tasks and changes and so on.

PiCluster - Manage Docker Containers

PiCluster is a simple way to manage Docker containers on multiple hosts. Docker Swarm not that good and Kubernetes was too difficult to install currently on ARM. PiCluster will only build and run images from Dockerfile's on the host specified in the config file. This software will work on regular x86 hardware also and is not tied to ARM.

Redisson - Redis based In-Memory Data Grid for Java

Redisson - distributed Java objects and services (Set, Multimap, SortedSet, Map, List, Queue, BlockingQueue, Deque, BlockingDeque, Semaphore, Lock, AtomicLong, Map Reduce, Publish / Subscribe, Bloom filter, Spring Cache, Executor service, Tomcat Session Manager, Scheduler service, JCache API) on top of Redis server. Rich Redis client.

OpenEBS - Containerized Storage for Containers

OpenEBS is containerized block storage written in Go for cloud native and other environments w/ per container (or pod) QoS SLAs, tiering and replica policies across AZs and environments, and predictable and scalable performance.

NSQ - A realtime distributed messaging platform in Go

NSQ is a realtime distributed messaging platform designed to operate at scale, handling billions of messages per day. It promotes distributed and decentralized topologies without single points of failure, enabling fault tolerance and high availability coupled with a reliable message delivery guarantee. It scales horizontally, without any centralized brokers. Built-in discovery simplifies the addition of nodes to the cluster.

pycluster - python server client cluster job scheduler

python server client cluster job scheduler

nomad-auto-join - Terraform config to automatically bootstrap a Nomad cluster

In a previous post, we explored how Consul discovers other agents using cloud metadata to bootstrap a cluster. This post looks at Nomad's auto-joining functionality and how we can use Terraform to create an autoscaled cluster.Unlike Consul, Nomad's auto bootstrapping functionality does not use cloud metadata because when Nomad pairs with Consul, we inherit the functionality. Consul's service discovery and health checking is the perfect platform to use for bootstrapping Nomad.

drmaa2 - DRMAA2 Go (#golang) API for job submission, job workflow management, and HPC cluster monitoring

A Go (#golang) API for job submission, job workflow management, and HPC cluster monitoring based on the open OGF DRMAA2 standard ( / The master branch contains now methods (like job.Reap()) which are specified in DRMAA2 2015 Errata. Those methods might not be (yet) in the underlying DRMAA2 C library. Hence I created a branch which is compatible with older DRMAA2 C implementations (UGE_82_Compatible) which can be used instead.


DNF Security’s Integrated Falcon-DL VME Cluster (Diskless) video recording and management server cluster, boots directly from your Seahawk 6GS-Class IP SAN, simplifying deployment and virtually eliminating points of failure in your video recording server. DNF Security provides Video Management System. Further Visit

DLWorkspace - Deep Learning Workspace

Deep Learning Workspace (DLWorkspace) is an open source toolkit that allows AI scientists to spin up an AI cluster in turn-key fashion (either in a public cloud such as Azure, or in an on-perm cluster). It has been used in daily production for Microsoft internal groups (e.g., Microsoft Cognitive Service, SwiftKey, Bing Relevance, etc...). Once setup, the DLWorkspace provides web UI and/or restful API that allows AI scientist to run job (interactive exploration, training, inferencing, data analystics) on the cluster with resource allocated by DL Workspace cluster for each job (e.g., a single node job with a couple of GPUs with GPU Direct connection, or a distributed job with multiple GPUs per node). DLWorkspace also provides unified job template and operating environment that allows AI scientists to easily share their job and setting among themselves and with outside community. DLWorkspace out-of-box supports all major deep learning toolkits (TensorFlow, CNTK, Caffe, MxNet, etc..), and supports popular big data analytic toolkit such as hadoop/spark.Here is a few short video clips that can quickly explain DLWorkspace. Note the PPT link will only work in repo, not in pages.

kubernetes-app - Grafana App for Kubernetes

Kubernetes is an open-source system for automating deployment, scaling, and management of containerized applications.The Grafana Kubernetes App allows you to monitor your Kubernetes cluster's performance. It includes 4 dashboards, Cluster, Node, Pod/Container and Deployment. It also comes with Intel Snap collectors that are deployed to your cluster to collect health metrics. The metrics collected are high-level cluster and node stats as well as lower level pod and container stats. Use the high-level metrics to alert on and the low-level metrics to troubleshoot.

Flocker - Container data volume manager for your Dockerized application

Flocker is an open-source Container Data Volume Manager for your Dockerized applications. By providing tools for data migrations, Flocker gives ops teams the tools they need to run containerized stateful services like databases in production. Unlike a Docker data volume which is tied to a single server, a Flocker data volume, called a dataset, is portable and can be used with any container, no matter where that container is running.

Apache REEF - a stdlib for Big Data

Apache REEF (Retainable Evaluator Execution Framework) is a library for developing portable applications for cluster resource managers such as Apache Hadoop YARN or Apache Mesos. For example, Microsoft Azure Stream Analytics is built on REEF and Hadoop.