Azkaban - Batch workflow job scheduler created at LinkedIn to run Hadoop jobs

  •        72

Azkaban is a batch workflow job scheduler created at LinkedIn to run Hadoop jobs. Azkaban resolves the ordering through job dependencies and provides an easy to use web user interface to maintain and track your workflows. Azkaban was designed primarily with usability in mind. It has been running at LinkedIn for several years, and drives many of their Hadoop and data warehouse processes.

Its features include:

  • Compatible with any version of Hadoop
  • Easy to use web UI
  • Simple web and http workflow uploads
  • Project workspaces
  • Scheduling of workflows
  • Modular and pluginable
  • Authentication and Authorization
  • Tracking of user actions
  • Email alerts on failure and successes
  • SLA alerting and auto killing
  • Retrying of failed jobs

https://azkaban.github.io/
https://github.com/azkaban/azkaban

Tags
Implementation
License
Platform

   




Related Projects

Luigi - Python module that helps you build complex pipelines of batch jobs

  •    Python

The purpose of Luigi is to address all the plumbing typically associated with long-running batch processes. You want to chain many tasks, automate them, and failures will happen. These tasks can be anything, but are typically long running things like Hadoop jobs, dumping data to/from databases, running machine learning algorithms, or anything else.

Apache Oozie - Workflow Scheduler for Hadoop

  •    Java

Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Its jobs are Directed Acyclical Graphs (DAGs) of actions and triggered by time (frequency) and data availability. Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts).

azkaban - Azkaban workflow manager.

  •    Java

Azkaban builds use Gradle and requires Java 8 or higher. The following set of commands run on *nix platforms like Linux, OS X.

DataSphereStudio - DataSphereStudio is a one stop data application development& management portal, covering scenarios including data exchange, desensitization/cleansing, analysis/mining, quality measurement, visualization, and task scheduling

  •    Java

DataSphere Studio (DSS for short) is WeDataSphere, a big data platform of WeBank, a self-developed one-stop data application development management portal. Based on Linkis computation middleware, DSS can easily integrate upper-level data application systems, making data application development simple and easy to use.

Schedulis - Schedulis is a high performance workflow task scheduling system that supports high availability and multi-tenant financial level features, Linkis computing middleware, and has been integrated into data application development portal DataSphere Studio

  •    Java

Schedulis is a high performance workflow task scheduling system that supports high availability and multi-tenant financial level features, Linkis computing middleware, and has been integrated into data application development portal DataSphere Studio


Hue - The open source Apache Hadoop UI

  •    Java

Hue is a Web application for interacting with Apache Hadoop. It supports a FileBrowser for accessing HDFS, JobBrowser for accessing MapReduce jobs (MR1/MR2-YARN), Job Designer for creating MapReduce/Streaming/Java jobs, HBase Browser for exploring and modifying HBase tables and data, Oozie App for submitting and scheduling workflows and bundles, A Pig/HBase/Sqoop2 shell, Beeswax application for executing Hive queries, Search app for querying Solr and Solr Cloud.

spring-hadoop - Spring for Apache Hadoop is a framework for application developers to take advantage of the features of both Hadoop and Spring

  •    Java

The Spring for Apache Hadoop project provides extensions to Spring, Spring Batch, and Spring Integration to build manageable and robust pipeline solutions around Hadoop.Spring for Apache Hadoop extends Spring Batch by providing support for reading from and writing to HDFS, running various types of Hadoop jobs (Java MapReduce, Streaming, Hive, Spark, Pig) and using HBase. An important goal is to provide excellent support for non-Java based developers to be productive using Spring Hadoop and not have to write any Java code to use the core feature set.

argo-workflows - Workflow engine for Kubernetes

  •    Go

Argo Workflows is an open source container-native workflow engine for orchestrating parallel jobs on Kubernetes. Argo Workflows is implemented as a Kubernetes CRD (Custom Resource Definition). Argo is a Cloud Native Computing Foundation (CNCF) hosted project.

Apache Tez - A Framework for YARN-based, Data Processing Applications In Hadoop

  •    Java

Apache Tez is an extensible framework for building high performance batch and interactive data processing applications, coordinated by YARN in Apache Hadoop. Tez improves the MapReduce paradigm by dramatically improving its speed, while maintaining MapReduce’s ability to scale to petabytes of data. Important Hadoop ecosystem projects like Apache Hive and Apache Pig use Apache Tez, as do a growing number of third party data access applications developed for the broader Hadoop ecosystem.

mrjob - Run MapReduce jobs on Hadoop or Amazon Web Services

  •    Python

mrjob is a Python 2.7/3.3+ package that helps you write and run Hadoop Streaming jobs. It fully supports Amazon's Elastic MapReduce (EMR) service, which allows you to buy time on a Hadoop cluster on an hourly basis. mrjob has basic support for Google Cloud Dataproc (Dataproc) which allows you to buy time on a Hadoop cluster on a minute-by-minute basis. It also works with your own Hadoop cluster.

oozie - Oozie - workflow engine for Hadoop

  •    Java

Oozie - workflow engine for Hadoop

dagster - An orchestration platform for the development, production, and observation of data assets.

  •    Python

An orchestration platform for the development, production, and observation of data assets. Dagster lets you define jobs in terms of the data flow between reusable, logical components, then test locally and run anywhere. With a unified view of jobs and the assets they produce, Dagster can schedule and orchestrate Pandas, Spark, SQL, or anything else that Python can invoke.

Schedulix - Enterprise Job Scheduling System

  •    Java

Schedulix is the Open Source Enterprise Job Scheduling System, which meets the complex requirements of modern IT process automation. It helps to create Complex workflow, Hierarchical workflow modelling, Workflows can be dynamically submitted or paralleled, Automatic reruns of sub-workflow, Load balancing, Sticky allocations, Time scheduling and lot more.

dr-elephant - Performance monitoring and tuning tool for Apache Hadoop

  •    Java

Dr. Elephant is a performance monitoring and tuning tool for Hadoop and Spark. It automatically gathers all the metrics, runs analysis on them, and presents them in a simple way for easy consumption. Its goal is to improve developer productivity and increase cluster efficiency by making it easier to tune the jobs. It analyzes the Hadoop and Spark jobs using a set of pluggable, configurable, rule-based heuristics that provide insights on how a job performed, and then uses the results to make suggestions about how to tune the job to make it perform more efficiently.For more information on Dr. Elephant, check the wiki pages here.

Scalding - A Scala API for Cascading

  •    Scala

Scalding is a Scala library that makes it easy to specify Hadoop MapReduce jobs. Scalding is built on top of Cascading, a Java library that abstracts away low-level Hadoop details. Scalding is comparable to Pig, but offers tight integration with Scala, bringing advantages of Scala to your MapReduce jobs.

chronos - Fault tolerant job scheduler for Mesos which handles dependencies and ISO8601 based schedules

  •    Scala

Chronos is a replacement for cron. It is a distributed and fault-tolerant scheduler that runs on top of Apache Mesos that can be used for job orchestration. It supports custom Mesos executors as well as the default command executor. Thus by default, Chronos executes sh (on most systems bash) scripts. Chronos can be used to interact with systems such as Hadoop (incl. EMR), even if the Mesos agents on which execution happens do not have Hadoop installed. Included wrapper scripts allow transfering files and executing them on a remote machine in the background and using asynchronous callbacks to notify Chronos of job completion or failures. Chronos is also natively able to schedule jobs that run inside Docker containers.

gush - Fast and distributed workflow runner using only Sidekiq and Redis

  •    Ruby

Gush is a parallel workflow runner using only Redis as storage and ActiveJob for scheduling and executing jobs. Gush relies on directed acyclic graphs to store dependencies, see Parallelizing Operations With Dependencies by Stephen Toub to learn more about this method.

Geckon Octopus

  •    

Geckon Octopus is a workflow engine. Build for handeling plugins that executes async jobs such as video encoding, ftp, taking stills etc. You define workflows in xml that executes jobs, that can both be in parallel and/or sequential.






We have large collection of open source products. Follow the tags from Tag Cloud >>


Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.