AirflowETL - Blog post on ETL pipelines with Airflow

  •        115

In this blog post I want to go over the operations of data engineering called Extract, Transform, Load (ETL) and show how they can be automated and scheduled using Apache Airflow. You can see the source code for this project here. Extracting data can be done in a multitude of ways, but one of the most common ways is to query a WEB API. If the query is sucessful, then we will receive data back from the API's server. Often times the data we get back is in the form of JSON. JSON can pretty much be thought of a semi-structured data or as a dictionary where the dictionary keys and values are strings. Since the data is a dictionary of strings this means we must transform it before storing or loading into a database. Airflow is a platform to schedule and monitor workflows and in this post I will show you how to use it to extract the daily weather in New York from the OpenWeatherMap API, convert the temperature to Celsius and load the data in a simple PostgreSQL database.



Related Projects

data-integration - A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow

  •    Python

Data integration pipelines as code: pipelines, tasks and commands are created using declarative Python code. PostgreSQL as a data processing engine.

pocket-etl - Extensible java library that orchestrates batched ETL (extract, transform and load) of data between services using native fluent java to express your pipeline

  •    Java

Extensible Java library that orchestrates batched ETL (extract, transform and load) of data between services using native fluent Java to express your pipeline.

ratchet - A library for performing data pipeline / ETL tasks in Go.

  •    Go

The Go programming language's simplicity, execution speed, and concurrency support make it a great choice for building data pipeline systems that can perform custom ETL (Extract, Transform, Load) tasks. Ratchet is a library that is written 100% in Go, and let's you easily build custom data pipelines by writing your own Go code. Each data processor is receiving, processing, and then sending data to the next stage in the pipeline. All data processors are running in their own goroutine, so all processing is happening concurrently. Go channels are connecting each stage of processing, so the syntax for sending data will be intuitive for anyone familiar with Go. All data being sent and received is JSON, which provides for a nice balance of flexibility and consistency.

tributary - Streaming reactive and dataflow graphs in Python

  •    Python

Tributary is a library for constructing dataflow graphs in python. Unlike many other DAG libraries in python (airflow, luigi, prefect, dagster, dask, kedro, etc), tributary is not designed with data/etl pipelines or scheduling in mind. Instead, tributary is more similar to libraries like mdf, pyungo, streamz, or pyfunctional, in that it is designed to be used as the implementation for a data model. One such example is the greeks library, which leverages tributary to build data models for options pricing.

Apache Beam - Unified model for defining both batch and streaming data-parallel processing pipelines

  •    Java

Apache Beam is an open source, unified model for defining both batch and streaming data-parallel processing pipelines. Using one of the open source Beam SDKs, you build a program that defines the pipeline. The pipeline is then executed by one of Beam’s supported distributed processing back-ends, which include Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow.

Transporter - Sync data between persistence engines, like ETL only not stodgy

  •    Go

Compose Transporter helps with database transformations from one store to another. It can also sync from one to another or several stores.Transporter allows the user to configure a number of data adaptors as sources or sinks. These can be databases, files or other resources. Data is read from the sources, converted into a message format, and then send down to the sink where the message is converted into a writable format for its destination. The user can also create data transformations in JavaScript which can sit between the source and sink and manipulate or filter the message flow.

Apache Tajo - A big data warehouse system on Hadoop

  •    Java

Apache Tajo is a robust big data relational and distributed data warehouse system for Apache Hadoop. Tajo is designed for low-latency and scalable ad-hoc queries, online aggregation, and ETL (extract-transform-load process) on large-data sets stored on HDFS (Hadoop Distributed File System) and other data sources. - A Clojure high performance data processing system

  •    Clojure is a Clojure library for data processing and machine learning. Datasets are currently in-memory columnwise databases and we support parsing from file or input-stream. We support these formats: raw/gzipped csv/tsv, xls, xlsx, json, and sequences of maps as input sources. SQL bindings are provided as a separate library. Data size in memory is minimized (primitive arrays), datetime types are often converted to an integer representation and strings are loaded into string tables. These features together dramatically decrease the working set size in memory. Because data is stored in columnar fashion columnwise operations on the dataset are very fast.

firebolt - Golang framework for streaming ETL, observability data pipeline, and event processing apps

  •    Go

Firebolt has a simple model intended to make it easier to write reliable pipeline applications that process a stream of data. Every application's pipeline starts with a single source, the component that receives events from some external system. Sources must implement the node.Source interface.

aegisthus - A Bulk Data Pipeline out of Cassandra

  •    Java

Aegisthus has been transitioned to maintenance mode. It is still used for ETL at Netflix for Cassandra 2.x clusters, but it will not be evolving further.A Bulk Data Pipeline out of Cassandra. Aegisthus implements a reader for the SSTable format and provides a map/reduce program to create a compacted snapshot of the data contained in a column family.

synch - Sync data from the other DB to ClickHouse(cluster)

  •    Python

Sync data from other DB to ClickHouse, current support postgres and mysql, and support full and increment ETL. synch will read default config from ./synch.yaml, or you can use synch -c specify config file.

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows

  •    Python

Apache Airflow (or simply Airflow) is a platform to programmatically author, schedule, and monitor workflows. When workflows are defined as code, they become more maintainable, versionable, testable, and collaborative.

stream-reactor - Streaming reference architecture for ETL with Kafka and Kafka-Connect

  •    Scala

Lenses offers SQL (for data browsing and Kafka Streams), Kafka Connect connector management, cluster monitoring and more. A collection of components to build a real time ingestion pipeline.

dagster - An orchestration platform for the development, production, and observation of data assets.

  •    Python

An orchestration platform for the development, production, and observation of data assets. Dagster lets you define jobs in terms of the data flow between reusable, logical components, then test locally and run anywhere. With a unified view of jobs and the assets they produce, Dagster can schedule and orchestrate Pandas, Spark, SQL, or anything else that Python can invoke.

pglogical - Logical Replication extension for PostgreSQL 9

  •    C

The pglogical extension provides logical streaming replication for PostgreSQL, using a publish/subscribe model. It is based on technology developed as part of the BDR project ( To use pglogical the provider and subscriber must be running PostgreSQL 9.4 or newer.

ByteBase - Web-based, zero-config, dependency-free database schema change and version control tool for teams

  •    Go

Bytebase is a web-based, zero-config, dependency-free database schema change and version control management tool for developers and DBAs. It is for developers to collaborate on database schemas changes. It helps to construct a single pipeline to propagate the schema change across multiple environments. It can also store the schemas in VCS and trigger a new pipeline upon commit push.

Ora2Pg - Tool used to migrate an Oracle database to a PostgreSQL compatible schema

  •    Perl

Ora2Pg is a free tool used to migrate an Oracle database to a PostgreSQL compatible schema. It connects your Oracle database, scan it automatically and extracts its structure or data, it then generates SQL scripts that you can load into PostgreSQL.

PostGIS - Spatial and Geographic objects for PostgreSQL

  •    C

PostGIS is a spatial database extender for PostgreSQL object-relational database. It adds support for geographic objects allowing location queries to be run in SQL. PostGIS adds extra types (geometry, geography, raster and others) to the PostgreSQL database. It also adds functions, operators, and index enhancements that apply to these spatial types.

We have large collection of open source products. Follow the tags from Tag Cloud >>

Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.