In this blog post I want to go over the operations of data engineering called Extract, Transform, Load (ETL) and show how they can be automated and scheduled using Apache Airflow. You can see the source code for this project here. Extracting data can be done in a multitude of ways, but one of the most common ways is to query a WEB API. If the query is sucessful, then we will receive data back from the API's server. Often times the data we get back is in the form of JSON. JSON can pretty much be thought of a semi-structured data or as a dictionary where the dictionary keys and values are strings. Since the data is a dictionary of strings this means we must transform it before storing or loading into a database. Airflow is a platform to schedule and monitor workflows and in this post I will show you how to use it to extract the daily weather in New York from the OpenWeatherMap API, convert the temperature to Celsius and load the data in a simple PostgreSQL database.
http://michael-harmon.com/blog/AirflowETL.htmlTags | airflow sql database schedule etl postgresql data-engineering data-pipeline etl-pipeline |
Implementation | Jupyter Notebook |
License | MIT |
Platform |
Data integration pipelines as code: pipelines, tasks and commands are created using declarative Python code. PostgreSQL as a data processing engine.
etl data-integration postgresql pipeline dataExtensible Java library that orchestrates batched ETL (extract, transform and load) of data between services using native fluent Java to express your pipeline.
etl extract-transform-load etl-toolsThe Go programming language's simplicity, execution speed, and concurrency support make it a great choice for building data pipeline systems that can perform custom ETL (Extract, Transform, Load) tasks. Ratchet is a library that is written 100% in Go, and let's you easily build custom data pipelines by writing your own Go code. Each data processor is receiving, processing, and then sending data to the next stage in the pipeline. All data processors are running in their own goroutine, so all processing is happening concurrently. Go channels are connecting each stage of processing, so the syntax for sending data will be intuitive for anyone familiar with Go. All data being sent and received is JSON, which provides for a nice balance of flexibility and consistency.
Tributary is a library for constructing dataflow graphs in python. Unlike many other DAG libraries in python (airflow, luigi, prefect, dagster, dask, kedro, etc), tributary is not designed with data/etl pipelines or scheduling in mind. Instead, tributary is more similar to libraries like mdf, pyungo, streamz, or pyfunctional, in that it is designed to be used as the implementation for a data model. One such example is the greeks library, which leverages tributary to build data models for options pricing.
streaming kafka stream asynchronous websockets python3 lazy-evaluation data-pipeline reactive-data-streams python-data-streamsApache Beam is an open source, unified model for defining both batch and streaming data-parallel processing pipelines. Using one of the open source Beam SDKs, you build a program that defines the pipeline. The pipeline is then executed by one of Beam’s supported distributed processing back-ends, which include Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow.
data-processing data-streaming batch-processing stream-processing distributed big-dataCompose Transporter helps with database transformations from one store to another. It can also sync from one to another or several stores.Transporter allows the user to configure a number of data adaptors as sources or sinks. These can be databases, files or other resources. Data is read from the sources, converted into a message format, and then send down to the sink where the message is converted into a writable format for its destination. The user can also create data transformations in JavaScript which can sit between the source and sink and manipulate or filter the message flow.
etl mongodb elasticsearch rethinkdb postgresql rabbitmq database-migration database-toolsApache Tajo is a robust big data relational and distributed data warehouse system for Apache Hadoop. Tajo is designed for low-latency and scalable ad-hoc queries, online aggregation, and ETL (extract-transform-load process) on large-data sets stored on HDFS (Hadoop Distributed File System) and other data sources.
data-warehouse etl aggregation analytics sql-on-hadoopVectorization Rosetta Stone for the JVM
etl spark machine-learning transformations svmlight hadoop-ecosystem writables schema pipeline formatter datapipeline data-mungingtech.ml.dataset is a Clojure library for data processing and machine learning. Datasets are currently in-memory columnwise databases and we support parsing from file or input-stream. We support these formats: raw/gzipped csv/tsv, xls, xlsx, json, and sequences of maps as input sources. SQL bindings are provided as a separate library. Data size in memory is minimized (primitive arrays), datetime types are often converted to an integer representation and strings are loaded into string tables. These features together dramatically decrease the working set size in memory. Because data is stored in columnar fashion columnwise operations on the dataset are very fast.
machine-learning csv xlsx datascience dataset dataframe etl-pipelineFirebolt has a simple model intended to make it easier to write reliable pipeline applications that process a stream of data. Every application's pipeline starts with a single source, the component that receives events from some external system. Sources must implement the node.Source interface.
Aegisthus has been transitioned to maintenance mode. It is still used for ETL at Netflix for Cassandra 2.x clusters, but it will not be evolving further.A Bulk Data Pipeline out of Cassandra. Aegisthus implements a reader for the SSTable format and provides a map/reduce program to create a compacted snapshot of the data contained in a column family.
Sync data from other DB to ClickHouse, current support postgres and mysql, and support full and increment ETL. synch will read default config from ./synch.yaml, or you can use synch -c specify config file.
mysql kafka replication clickhouse postgresql data-etl increment-etlApache Airflow (or simply Airflow) is a platform to programmatically author, schedule, and monitor workflows. When workflows are defined as code, they become more maintainable, versionable, testable, and collaborative.
workflow automation configuration-as-code directed-acyclic-graphs visualization monitor-workflow schedule-workflow schedulerAlways know what to expect from your data. Great Expectations helps data teams eliminate pipeline debt, through data testing, documentation, and profiling.
data-science pipeline exploratory-data-analysis eda data-engineering data-quality data-profiling datacleaner exploratory-analysis cleandata dataquality datacleaning mlops pipeline-tests pipeline-testing dataunittest data-unit-tests exploratorydataanalysis pipeline-debt data-profilersLenses offers SQL (for data browsing and Kafka Streams), Kafka Connect connector management, cluster monitoring and more. A collection of components to build a real time ingestion pipeline.
kafka kafka-connect connector streaming cassandra hazelcast redis elasticsearch ftp influxdb coap mqtt kudu jms hbase mongodb rethinkdb documentdb cosmosdb kubernetesAn orchestration platform for the development, production, and observation of data assets. Dagster lets you define jobs in terms of the data flow between reusable, logical components, then test locally and run anywhere. With a unified view of jobs and the assets they produce, Dagster can schedule and orchestrate Pandas, Spark, SQL, or anything else that Python can invoke.
workflow data-science etl analytics scheduler data-pipelines workflow-automation dagster data-orchestratorThe pglogical extension provides logical streaming replication for PostgreSQL, using a publish/subscribe model. It is based on technology developed as part of the BDR project (http://2ndquadrant.com/BDR). To use pglogical the provider and subscriber must be running PostgreSQL 9.4 or newer.
postgresql replication logical-decoding database-replication subscription publish-subscribe data-transformation data-transport etl cdc zero-downtimeBytebase is a web-based, zero-config, dependency-free database schema change and version control management tool for developers and DBAs. It is for developers to collaborate on database schemas changes. It helps to construct a single pipeline to propagate the schema change across multiple environments. It can also store the schemas in VCS and trigger a new pipeline upon commit push.
mysql devops gitlab schema sql frontend clickhouse dml postgresql snowflake ddl dba tidb database-as-code sqlreview schema-changes gitops schema-migration database-migrationOra2Pg is a free tool used to migrate an Oracle database to a PostgreSQL compatible schema. It connects your Oracle database, scan it automatically and extracts its structure or data, it then generates SQL scripts that you can load into PostgreSQL.
database-tools database-migration oracle-to-postgresql postgresql-toolsPostGIS is a spatial database extender for PostgreSQL object-relational database. It adds support for geographic objects allowing location queries to be run in SQL. PostGIS adds extra types (geometry, geography, raster and others) to the PostgreSQL database. It also adds functions, operators, and index enhancements that apply to these spatial types.
database geospatial-database spatial-database geospatial geospatial-analytics postgresql-extension
We have large collection of open source products. Follow the tags from
Tag Cloud >>
Open source products are scattered around the web. Please provide information
about the open source projects you own / you use.
Add Projects.