Displaying 1 to 20 from 23 results

data-engineer-roadmap - Roadmap to becoming a data engineer in 2021


This roadmap aims to give a complete picture of the modern data engineering landscape and serve as a study guide for aspiring data engineers. Beginners shouldn’t feel overwhelmed by the vast number of tools and frameworks listed here. A typical data engineer would master a subset of these tools throughout several years depending on his/her company and career choices.

prefect - The easiest way to automate your data

  •    Python

We've rebuilt data engineering for the data science era. Prefect is a new workflow management system, designed for modern infrastructure and powered by the open-source Prefect Core workflow engine. Users organize Tasks into Flows, and Prefect takes care of the rest.

lakeFS - Git-like capabilities for your object storage

  •    Go

lakeFS is an open source layer that delivers resilience and manageability to object-storage based data lakes. With lakeFS you can build repeatable, atomic and versioned data lake operations - from complex ETL jobs to data science and analytics.

Apache Superset is a Data Visualization and Data Exploration Platform

  •    Python

Superset is fast, lightweight, intuitive, and loaded with options that make it easy for users of all skill sets to explore and visualize their data, from simple line charts to highly detailed geospatial charts. It easily integrates your data, using either our simple no-code viz builder or state of the art SQL IDE. Superset can query data from any SQL-speaking datastore or data engine (e.g. Presto or Athena) that has a Python DB-API driver and a SQLAlchemy dialect.

aws-serverless-data-lake-framework - Enterprise-grade, production-hardened, serverless data lake on AWS

  •    Python

The Serverless Data Lake Framework (SDLF) is a collection of reusable artifacts aimed at accelerating the delivery of enterprise data lakes on AWS, shortening the deployment time to production from several months to a few weeks. It can be used by AWS teams, partners and customers to implement the foundational structure of a data lake following best practices. A data lake gives your organization agility. It provides a repository where consumers can quickly find the data they need and use it in their business projects. However, building a data lake can be complex; there’s a lot to think about beyond the storage of files. For example, how do you catalog the data so you know what you’ve stored? What ingestion pipelines do you need? How do you manage data quality? How do you keep the code for your transformations under source control? How do you manage development, test and production environments? Building a solution that addresses these use cases can take many weeks and this time can be better spent innovating with data and achieving business goals. The SDLF is a collection of production-hardened, best practice templates which accelerate your data lake implementation journey on AWS, so that you can focus on use cases that generate value for business.

neon-workshop - A Pachyderm deep learning tutorial for conference workshops

  •    Python

This workshop focuses on building a production scale machine learning pipeline with Pachyderm that integrates Nervana Neon training and inference. In particular, this pipeline trains and utilizes a model that predicts the sentiment of movie reviews, based on data from IMDB. Finally, we provide some Resources for you for further exploration.

cauldron - The Unnotebook: A Production-Ready Data Environment Built for DataOps Workflows

  •    Python

which must be executed in the root project directory of your local copy of Cauldron. Cauldron can be used as either through its Command Line Interface (CLI) or with the Cauldron desktop application. For more information about the desktop application visit http://www.unnotebook.com where you can find the download links and documentation. The rest of this README describes using Cauldron directly from the command line.

Learn-Something-Every-Day - 📝 A compilation of everything that I learn; Computer Science, Software Development, Engineering, Math, and Coding in General

  •    CSS

I take a lot of summary notes here, but I will often also put my learnings into an anki deck which is a wonderful way to do spaced repetition learning for long term retention. Taking summary notes in combination with some form of active recollection has worked really well for me and would be my recommendation to anyone looking to always be learning in this fast changing industry. Write notes in Markdown with embedded LaTeX. When you push to develop, get CircleCI to render HTML pages using a small Ruby script and Pandoc, and then push the results to a Github Pages branch. The website build process is based on work by @davepagurek.

bytehub - ByteHub: making feature stores simple

  •    Python

An easy-to-use feature store. A feature store is a data storage system for data science and machine-learning. It can store raw data and also transformed features, which can be fed straight into an ML model or training script.

ui - The home of the Prefect UI

  •    Vue

Note: This repo is for Prefect UI development. To run the Prefect UI as part of Prefect Server, install Prefect and run prefect server start. Prefect UI requires Node.js v14 and npm v6 to run.

AirflowETL - Blog post on ETL pipelines with Airflow

  •    Jupyter

In this blog post I want to go over the operations of data engineering called Extract, Transform, Load (ETL) and show how they can be automated and scheduled using Apache Airflow. You can see the source code for this project here. Extracting data can be done in a multitude of ways, but one of the most common ways is to query a WEB API. If the query is sucessful, then we will receive data back from the API's server. Often times the data we get back is in the form of JSON. JSON can pretty much be thought of a semi-structured data or as a dictionary where the dictionary keys and values are strings. Since the data is a dictionary of strings this means we must transform it before storing or loading into a database. Airflow is a platform to schedule and monitor workflows and in this post I will show you how to use it to extract the daily weather in New York from the OpenWeatherMap API, convert the temperature to Celsius and load the data in a simple PostgreSQL database.

geni - A Clojure dataframe library that runs on Spark

  •    Clojure

Geni (/gɜni/ or "gurney" without the r) is a Clojure dataframe library that runs on Apache Spark. The name means "fire" in Javanese. Geni provides an idiomatic Spark interface for Clojure without the hassle of Java or Scala interop. Geni uses Clojure's -> threading macro as the main way to compose Spark's Dataset and Column operations in place of the usual method chaining in Scala. It also provides a greater degree of dynamism by allowing args of mixed types such as columns, strings and keywords in a single function invocation. See the docs section on Geni semantics for more details.

akka-lift-ml - akka http service for serving spark machine learning models

  •    Scala

Repository for an akka microservice that lift the trained spark ml algorithms as a actorsystem with http endpoints. akka-lift-ml helps you with the hard data engineering part, when you have found a good solutions with your data science team. The service can train your models on a remote spark instance and serve the results with a small local spark service. You can access it over http e.g. with the integrated swagger ui. To build your own system you need sbt and scala. The trained models are saved to AWS S3 and referenced in a postgres database, so can scale out your instances for load balacing.

versatile-data-kit - Versatile Data Kit is a data engineering framework that enables Data Engineers to develop, troubleshoot, deploy, run, and manage data processing workloads

  •    Python

Versatile Data Kit is a data engineering framework that enables Data Engineers to develop, troubleshoot, deploy, run, and manage data processing workloads (referred to as "Data Jobs"). A "Data Job" enables Data Engineers to implement automated pull ingestion (E in ELT) and batch data transformation (T in ELT) into a database. Versatile Data Kit provides an abstraction layer that helps solve common data engineering problems. It can be called by the workflow engine with the goal of making data engineers more efficient (for example, it ensures data applications are packaged, versioned and deployed correctly, while dealing with credentials, retries, reconnects, etc.). Everything exposed by Versatile Data Kit provides built-in monitoring, troubleshooting, and smart notification capabilities. For example, tracking both code and data modifications and the relations between them enables engineers to troubleshoot more quickly and provides an easy revert to a stable version.

We have large collection of open source products. Follow the tags from Tag Cloud >>

Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.