lakeFS is an open source layer that delivers resilience and manageability to object-storage based data lakes. With lakeFS you can build repeatable, atomic and versioned data lake operations - from complex ETL jobs to data science and analytics.
apache-spark aws-s3 google-cloud-storage data-engineering data-lake object-storage datalake hadoop-filesystem data-quality data-versioning azure-blob-storage apache-sparksql git-for-data lakefs datalakesAlways know what to expect from your data. Great Expectations helps data teams eliminate pipeline debt, through data testing, documentation, and profiling.
data-science pipeline exploratory-data-analysis eda data-engineering data-quality data-profiling datacleaner exploratory-analysis cleandata dataquality datacleaning mlops pipeline-tests pipeline-testing dataunittest data-unit-tests exploratorydataanalysis pipeline-debt data-profilersQualitis is a data quality management platform that supports quality verification, notification, and management for various datasource. It is used to solve various data quality problems caused by data processing. Based on Spring Boot, Qualitis submits quality model task to Linkis platform. It provides functions such as data quality model construction, data quality model execution, data quality verification, reports of data quality generation and so on.
workflow quality compare dss data-quality quality-improvement quality-check linkis datashperestudio data-quality-modelDataSphere Studio, Linkis, Scriptis, Qualitis, Schedulis, Exchangis. DataSphere Studio is positioned as a data application development portal, and the closed loop covers the entire process of data application development. With a unified UI, the workflow-like graphical drag-and-drop development experience meets the entire lifecycle of data application development from data import, desensitization cleaning, data analysis, data mining, quality inspection, visualization, scheduling to data output applications, etc.
bi kafka spark hive hadoop etl scheduler ide hbase portal mask sqoop data-quality data-mapMDS Modeling Workbook is a modeling tool and a solution accelerator for Microsoft Master Data Services.
cdi data-architecture data-quality excel excel-2010 master-data master-data-servicesJumbune is an open-source product built for analyzing Hadoop cluster and MapReduce jobs. It provides development & administrative insights of Hadoop based analytical solutions. It enables user to Debug, Profile, Monitor & Validate analytical solutions hosted on decoupled clusters.
hadoop-cluster yarn yarn-hadoop-cluster optimization-framework hadoop hadoop-monitor data-quality data-analysis devops-tools developer-tools monitoring-toolOur first idea was to answer to this question: can we assess the quality of OpenStreetMap data? (and how?). This project is dedicated to explore and analyze the OpenStreetMap data history in order to classify the contributors.
openstreetmap luigi data-quality pca kmeans osm data-analysis machine-learning statisticsIt has been recognized that poor data quality can have multiple negative impact to enterprises [1]. Businesses operating on dirty data are in risk of causing large amount of financial loses. Maintaining data quality can also increases operational cost as business would need to spend time and resources to detect erroneous data and correct them. As data grows bigger these days, data repairing has became an important problem and an important research area. DTCleaner produces multi-target decision trees for the purpose of data cleaning. It's built for detecting erroneous tuples in the dataset based on given set of conditional functional dependencies (CFDs) and building a classification model to predict erroneous tuples such that the "cleaned" dataset satisfies the CFDs, and semantically correct.
data-science data-quality data-cleaning data-mining data-preprocessing data-wranglingLinked Open Data (LOD) has emerged as one of the largest collection of interlinked datasets on the web. Benefiting from this mine of data requires the existence of descriptive information about each dataset in the accompanying metadata. Such meta information is currently very limited to few data portals where they are usually provided manually thus giving little or bad quality insights. To address this issue, we propose a scalable automatic approach for extracting, validating and generating descriptive linked dataset profiles. This approach applies several techniques to check the validity of the attached metadata as well as providing descriptive and statistical information of a certain dataset as well as a whole data portal. Using our framework on prominent data portals shows that the general state of the Linked Open Data needs attention as most of datasets suffer from bad quality metadata and lack additional informative metrics. The identification process for each portal can be easily customized by overriding the prototype.check function for each parser. Moreover, adding or removing steps from the identification process can be easily configured.
node data-quality dataset portal ckan ckan-api dataset-catalog dataset-metadata data-profilingTo upload issues, set the variable to the directory where the resulting issue .csv files were output, specifcy the data version in the variable, and specify the site. Sourcing the script will upload the issues. This toolkit has been designed for conducting data quality assessments on clinical datasets modeled using the OMOP common data model. The toolkit includes a wide variety of data quality checks and a GitHub-based issue reporting mechanism. The toolkit is being routinely used by the PEDSnet CDRN.
data-quality-checks data-quality omop pedsnetThis is an experimental library as of June 2021! The Great Expectations core team maintains this provider in an experimental state and does not guarantee ongoing support yet. An Airflow operator for Great Expectations, a Python library for testing and validating data.
data-science data-quality airflow-operators data-testingGreat Expectations is a leading tool for validating, documenting, and profiling your data to maintain quality and improve communication between teams. In order to configure the GitHub action for your repository, add the following code snippet to your GitHub workflows file. The file should be located under my_repo_name/.github/my_workflow.yml.
data-science continuous-integration actions data-quality data-integrity mlopsRead our AWS Big Data Blog for an in-depth look at this solution. More details on Deequ can be found in this AWS Blog.
aws data-quality aws-glue deequSQLBucket is a lightweight framework to help write, orchestrate and validate SQL data pipelines. It gives the possibility to set variables and introduces some control flow using the fantastic Jinja2 library. It also implements a very simplistic unit and integration test framework where you can validate the results of your ETL in the form of SQL checks. With SQLBucket, you can apply TDD principles when writing data pipelines. It can work as a stand alone service, or be part of your workflow manager environment (Airflow, Luigi, ..).
sql etl data-quality-checks data-quality etl-framework data-integrity data-engineering-workflowsVersatile Data Kit is a data engineering framework that enables Data Engineers to develop, troubleshoot, deploy, run, and manage data processing workloads (referred to as "Data Jobs"). A "Data Job" enables Data Engineers to implement automated pull ingestion (E in ELT) and batch data transformation (T in ELT) into a database. Versatile Data Kit provides an abstraction layer that helps solve common data engineering problems. It can be called by the workflow engine with the goal of making data engineers more efficient (for example, it ensures data applications are packaged, versioned and deployed correctly, while dealing with credentials, retries, reconnects, etc.). Everything exposed by Versatile Data Kit provides built-in monitoring, troubleshooting, and smart notification capabilities. For example, tracking both code and data modifications and the relations between them enables engineers to troubleshoot more quickly and provides an easy revert to a stable version.
data-science sql etl analytics sqlite plugins data-warehouse data-engineering warehouse elt data-pipelines data-quality data-engineer trino data-lineage
We have large collection of open source products. Follow the tags from
Tag Cloud >>
Open source products are scattered around the web. Please provide information
about the open source projects you own / you use.
Add Projects.