:bar_chart: :clipboard: Dashboards using YAML or JSON files
data-science data-visualization dashboard data-engineering d3 d3js chart data yaml csv json gist github-gist big-data business-intelligence data-driven just-dashboardThis roadmap aims to give a complete picture of the modern data engineering landscape and serve as a study guide for aspiring data engineers. Beginners shouldn’t feel overwhelmed by the vast number of tools and frameworks listed here. A typical data engineer would master a subset of these tools throughout several years depending on his/her company and career choices.
roadmap cloud data-engineering data-engineer-roadmapWe've rebuilt data engineering for the data science era. Prefect is a new workflow management system, designed for modern infrastructure and powered by the open-source Prefect Core workflow engine. Users organize Tasks into Flows, and Prefect takes care of the rest.
infrastructure workflow data-science automation workflow-engine data-engineering prefect data-ops ml-opslakeFS is an open source layer that delivers resilience and manageability to object-storage based data lakes. With lakeFS you can build repeatable, atomic and versioned data lake operations - from complex ETL jobs to data science and analytics.
apache-spark aws-s3 google-cloud-storage data-engineering data-lake object-storage datalake hadoop-filesystem data-quality data-versioning azure-blob-storage apache-sparksql git-for-data lakefs datalakesAlways know what to expect from your data. Great Expectations helps data teams eliminate pipeline debt, through data testing, documentation, and profiling.
data-science pipeline exploratory-data-analysis eda data-engineering data-quality data-profiling datacleaner exploratory-analysis cleandata dataquality datacleaning mlops pipeline-tests pipeline-testing dataunittest data-unit-tests exploratorydataanalysis pipeline-debt data-profilersSuperset is fast, lightweight, intuitive, and loaded with options that make it easy for users of all skill sets to explore and visualize their data, from simple line charts to highly detailed geospatial charts. It easily integrates your data, using either our simple no-code viz builder or state of the art SQL IDE. Superset can query data from any SQL-speaking datastore or data engine (e.g. Presto or Athena) that has a Python DB-API driver and a SQLAlchemy dialect.
react flask data-science bi analytics superset apache data-visualization data-engineering business-intelligence data-viz data-analytics data-analysis sql-editor asf business-analyticsThe Serverless Data Lake Framework (SDLF) is a collection of reusable artifacts aimed at accelerating the delivery of enterprise data lakes on AWS, shortening the deployment time to production from several months to a few weeks. It can be used by AWS teams, partners and customers to implement the foundational structure of a data lake following best practices. A data lake gives your organization agility. It provides a repository where consumers can quickly find the data they need and use it in their business projects. However, building a data lake can be complex; there’s a lot to think about beyond the storage of files. For example, how do you catalog the data so you know what you’ve stored? What ingestion pipelines do you need? How do you manage data quality? How do you keep the code for your transformations under source control? How do you manage development, test and production environments? Building a solution that addresses these use cases can take many weeks and this time can be better spent innovating with data and achieving business goals. The SDLF is a collection of production-hardened, best practice templates which accelerate your data lake implementation journey on AWS, so that you can focus on use cases that generate value for business.
aws framework serverless etl analytics best-practices data-engineering iac data-lake lake-formationSource code accompanying book: Data Science on the Google Cloud Platform, Valliappa Lakshmanan, O'Reilly 2017
data-analysis data-visualization cloud-computing machine-learning data-pipeline data-processing data-science data-engineeringThis workshop focuses on building a production scale machine learning pipeline with Pachyderm that integrates Nervana Neon training and inference. In particular, this pipeline trains and utilizes a model that predicts the sentiment of movie reviews, based on data from IMDB. Finally, we provide some Resources for you for further exploration.
machine-learning deep-learning data-science data-engineering data-pipelines containers kubernetes dockerFoxtrot is a scalable data and query store service for for real-time event data.
analytics elasticsearch hbase data-visualization data-science data-engineering alerting monitoringwhich must be executed in the root project directory of your local copy of Cauldron. Cauldron can be used as either through its Command Line Interface (CLI) or with the Cauldron desktop application. For more information about the desktop application visit http://www.unnotebook.com where you can find the download links and documentation. The rest of this README describes using Cauldron directly from the command line.
data-science python-3 notebook notebooks data-engineering dataopsI take a lot of summary notes here, but I will often also put my learnings into an anki deck which is a wonderful way to do spaced repetition learning for long term retention. Taking summary notes in combination with some form of active recollection has worked really well for me and would be my recommendation to anyone looking to always be learning in this fast changing industry. Write notes in Markdown with embedded LaTeX. When you push to develop, get CircleCI to render HTML pages using a small Ruby script and Pandoc, and then push the results to a Github Pages branch. The website build process is based on work by @davepagurek.
learning computer-science software-engineering mathematics waterloo unix data-science data-engineering algorithm university educational education math engineering research course-materials aws blogSchedule for talks, workshops, etc. w/ links to past talk slides and videos.
data-science machine-learning artificial-intelligence data-engineering kubernetes dockerAn easy-to-use feature store. A feature store is a data storage system for data science and machine-learning. It can store raw data and also transformed features, which can be fed straight into an ML model or training script.
data-science machine-learning timeseries pandas data-engineering forecasting machinelearning dask feature-engineering machinelearning-python feature-store featurestore bytehub-cloudNote: This repo is for Prefect UI development. To run the Prefect UI as part of Prefect Server, install Prefect and run prefect server start. Prefect UI requires Node.js v14 and npm v6 to run.
workflow automation vue data-engineering hacktoberfest prefect prefect-ui prefect-serverExample of an ETL Pipeline using Airflow
airflow etl postgresql data-engineering data-pipelinesIn this blog post I want to go over the operations of data engineering called Extract, Transform, Load (ETL) and show how they can be automated and scheduled using Apache Airflow. You can see the source code for this project here. Extracting data can be done in a multitude of ways, but one of the most common ways is to query a WEB API. If the query is sucessful, then we will receive data back from the API's server. Often times the data we get back is in the form of JSON. JSON can pretty much be thought of a semi-structured data or as a dictionary where the dictionary keys and values are strings. Since the data is a dictionary of strings this means we must transform it before storing or loading into a database. Airflow is a platform to schedule and monitor workflows and in this post I will show you how to use it to extract the daily weather in New York from the OpenWeatherMap API, convert the temperature to Celsius and load the data in a simple PostgreSQL database.
airflow sql database schedule etl postgresql data-engineering data-pipeline etl-pipelineGeni (/gɜni/ or "gurney" without the r) is a Clojure dataframe library that runs on Apache Spark. The name means "fire" in Javanese. Geni provides an idiomatic Spark interface for Clojure without the hassle of Java or Scala interop. Geni uses Clojure's -> threading macro as the main way to compose Spark's Dataset and Column operations in place of the usual method chaining in Scala. It also provides a greater degree of dynamism by allowing args of mixed types such as columns, strings and keywords in a single function invocation. See the docs section on Geni semantics for more details.
data-science machine-learning big-data spark parallel-computing distributed-computing data-engineering clojure-library high-performance-computing dataframe clojure-replRepository for an akka microservice that lift the trained spark ml algorithms as a actorsystem with http endpoints. akka-lift-ml helps you with the hard data engineering part, when you have found a good solutions with your data science team. The service can train your models on a remote spark instance and serve the results with a small local spark service. You can access it over http e.g. with the integrated swagger ui. To build your own system you need sbt and scala. The trained models are saved to AWS S3 and referenced in a postgres database, so can scale out your instances for load balacing.
machine-learning akka akka-http spark data-engineering fast-dataVersatile Data Kit is a data engineering framework that enables Data Engineers to develop, troubleshoot, deploy, run, and manage data processing workloads (referred to as "Data Jobs"). A "Data Job" enables Data Engineers to implement automated pull ingestion (E in ELT) and batch data transformation (T in ELT) into a database. Versatile Data Kit provides an abstraction layer that helps solve common data engineering problems. It can be called by the workflow engine with the goal of making data engineers more efficient (for example, it ensures data applications are packaged, versioned and deployed correctly, while dealing with credentials, retries, reconnects, etc.). Everything exposed by Versatile Data Kit provides built-in monitoring, troubleshooting, and smart notification capabilities. For example, tracking both code and data modifications and the relations between them enables engineers to troubleshoot more quickly and provides an easy revert to a stable version.
data-science sql etl analytics sqlite plugins data-warehouse data-engineering warehouse elt data-pipelines data-quality data-engineer trino data-lineage
We have large collection of open source products. Follow the tags from
Tag Cloud >>
Open source products are scattered around the web. Please provide information
about the open source projects you own / you use.
Add Projects.