Displaying 1 to 16 from 16 results

lakeFS - Git-like capabilities for your object storage

  •    Go

lakeFS is an open source layer that delivers resilience and manageability to object-storage based data lakes. With lakeFS you can build repeatable, atomic and versioned data lake operations - from complex ETL jobs to data science and analytics.

Qualitis - Qualitis is a one-stop data quality management platform that supports quality verification, notification, and management for various datasource

  •    Java

Qualitis is a data quality management platform that supports quality verification, notification, and management for various datasource. It is used to solve various data quality problems caused by data processing. Based on Spring Boot, Qualitis submits quality model task to Linkis platform. It provides functions such as data quality model construction, data quality model execution, data quality verification, reports of data quality generation and so on.

WeDataSphere - WeDataSphere is a financial level one-stop open-source suitcase for big data platforms


DataSphere Studio, Linkis, Scriptis, Qualitis, Schedulis, Exchangis. DataSphere Studio is positioned as a data application development portal, and the closed loop covers the entire process of data application development. With a unified UI, the workflow-like graphical drag-and-drop development experience meets the entire lifecycle of data application development from data import, desensitization cleaning, data analysis, data mining, quality inspection, visualization, scheduling to data output applications, etc.



MDS Modeling Workbook is a modeling tool and a solution accelerator for Microsoft Master Data Services.

jumbune - Jumbune is an open-source project to optimize both Yarn (v2) and older (v1) Hadoop based solutions

  •    Java

Jumbune is an open-source product built for analyzing Hadoop cluster and MapReduce jobs. It provides development & administrative insights of Hadoop based analytical solutions. It enables user to Debug, Profile, Monitor & Validate analytical solutions hosted on decoupled clusters.

osm-data-classification - OpenStreetMap Data Classification

  •    Python

Our first idea was to answer to this question: can we assess the quality of OpenStreetMap data? (and how?). This project is dedicated to explore and analyze the OpenStreetMap data history in order to classify the contributors.

DTCleaner - DTCleaner: data cleaning using multi-target decision trees.

  •    Java

It has been recognized that poor data quality can have multiple negative impact to enterprises [1]. Businesses operating on dirty data are in risk of causing large amount of financial loses. Maintaining data quality can also increases operational cost as business would need to spend time and resources to detect erroneous data and correct them. As data grows bigger these days, data repairing has became an important problem and an important research area. DTCleaner produces multi-target decision trees for the purpose of data cleaning. It's built for detecting erroneous tuples in the dataset based on given set of conditional functional dependencies (CFDs) and building a classification model to predict erroneous tuples such that the "cleaned" dataset satisfies the CFDs, and semantically correct.

openData-checker - A Node

  •    Javascript

Linked Open Data (LOD) has emerged as one of the largest collection of interlinked datasets on the web. Benefiting from this mine of data requires the existence of descriptive information about each dataset in the accompanying metadata. Such meta information is currently very limited to few data portals where they are usually provided manually thus giving little or bad quality insights. To address this issue, we propose a scalable automatic approach for extracting, validating and generating descriptive linked dataset profiles. This approach applies several techniques to check the validity of the attached metadata as well as providing descriptive and statistical information of a certain dataset as well as a whole data portal. Using our framework on prominent data portals shows that the general state of the Linked Open Data needs attention as most of datasets suffer from bad quality metadata and lack additional informative metrics. The identification process for each portal can be easily customized by overriding the prototype.check function for each parser. Moreover, adding or removing steps from the identification process can be easily configured.

Data-Quality-Analysis - The PEDSnet Data Quality Assessment Toolkit (OMOP CDM)

  •    R

To upload issues, set the variable to the directory where the resulting issue .csv files were output, specifcy the data version in the variable, and specify the site. Sourcing the script will upload the issues. This toolkit has been designed for conducting data quality assessments on clinical datasets modeled using the OMOP common data model. The toolkit includes a wide variety of data quality checks and a GitHub-based issue reporting mechanism. The toolkit is being routinely used by the PEDSnet CDRN.

airflow-provider-great-expectations - Great Expectations Airflow operator

  •    Python

This is an experimental library as of June 2021! The Great Expectations core team maintains this provider in an experimental state and does not guarantee ongoing support yet. An Airflow operator for Great Expectations, a Python library for testing and validating data.

great_expectations_action - A GitHub Action that makes it easy to use Great Expectations to validate your data pipelines in your CI workflows

  •    Jupyter

Great Expectations is a leading tool for validating, documenting, and profiling your data to maintain quality and improve communication between teams. In order to configure the GitHub action for your repository, add the following code snippet to your GitHub workflows file. The file should be located under my_repo_name/.github/my_workflow.yml.

amazon-deequ-glue - Automated data quality suggestions and analysis with Deequ on AWS Glue

  •    Scala

Read our AWS Big Data Blog for an in-depth look at this solution. More details on Deequ can be found in this AWS Blog.

sqlbucket - Lightweight library to write, orchestrate and test your SQL ETL

  •    Python

SQLBucket is a lightweight framework to help write, orchestrate and validate SQL data pipelines. It gives the possibility to set variables and introduces some control flow using the fantastic Jinja2 library. It also implements a very simplistic unit and integration test framework where you can validate the results of your ETL in the form of SQL checks. With SQLBucket, you can apply TDD principles when writing data pipelines. It can work as a stand alone service, or be part of your workflow manager environment (Airflow, Luigi, ..).

versatile-data-kit - Versatile Data Kit is a data engineering framework that enables Data Engineers to develop, troubleshoot, deploy, run, and manage data processing workloads

  •    Python

Versatile Data Kit is a data engineering framework that enables Data Engineers to develop, troubleshoot, deploy, run, and manage data processing workloads (referred to as "Data Jobs"). A "Data Job" enables Data Engineers to implement automated pull ingestion (E in ELT) and batch data transformation (T in ELT) into a database. Versatile Data Kit provides an abstraction layer that helps solve common data engineering problems. It can be called by the workflow engine with the goal of making data engineers more efficient (for example, it ensures data applications are packaged, versioned and deployed correctly, while dealing with credentials, retries, reconnects, etc.). Everything exposed by Versatile Data Kit provides built-in monitoring, troubleshooting, and smart notification capabilities. For example, tracking both code and data modifications and the relations between them enables engineers to troubleshoot more quickly and provides an easy revert to a stable version.

We have large collection of open source products. Follow the tags from Tag Cloud >>

Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.