Displaying 1 to 5 from 5 results

lakeFS - Git-like capabilities for your object storage

  •    Go

lakeFS is an open source layer that delivers resilience and manageability to object-storage based data lakes. With lakeFS you can build repeatable, atomic and versioned data lake operations - from complex ETL jobs to data science and analytics.

kedro - A Python framework for creating reproducible, maintainable and modular data science code.

  •    Python

Kedro is an open-source Python framework for creating reproducible, maintainable and modular data science code. It borrows concepts from software engineering and applies them to machine-learning code; applied concepts include modularity, separation of concerns and versioning. Our Get Started guide contains full installation instructions, and includes how to set up Python virtual environments.

Quilt - Data Engineering Infrastructure

  •    Python

With Quilt you can build, push, and install data packages. Data packages are versioned, reusable data structures that can be loaded into Python. Quilt is designed to support reproducible, auditable, and compliant workflows. Quilt consists of three source-level components data catalog, data registry and data compiler.

RecallGraph - A versioning data store for time-variant graph data

  •    Javascript

RecallGraph is a versioned-graph data store for time variant graphs built on ArangoDB. It retains all changes that its data (vertices and edges) have gone through to reach their current state. It supports point-in-time graph traversals, letting the user query any past state of the graph just as easily as the present. It is a Foxx Microservice for ArangoDB that features VCS-like semantics in many parts of its interface, and is backed by a transactional event tracker.

data-versioning - Collecting thoughts about data versioning


Version control is a huge part of reproducible research and open source software development. Versioning provides a complete history of some digital object (e.g., a software program, a research project, etc.) and, importantly, allows one to trace what changes have made to that object, when those changes were made, and (with the appropriate metadata) why those changes were made. This document holds some of my current thinking about version control for data. Especially in the social sciences, researchers depend on large, public datasets (e.g., Polity, Quality of Government, Correlates of War, ANES, ESS, etc.) as source material for quantitative research. These datasets typically evolve (new data is added over time, corrections are made to data values, etc.) and new releases are periodically made public. Sometimes these data are complex collaborative efforts (see, for example, Quality of Government) and others are public releases of single-institution data collection efforts (e.g., ANES). While collaborative datasets create a more obvious use case for version control, single-institution datasets might also be improved by version control. This is particularly important because old releases of these vital datasets are often not archived (e.g., ANES) meaning that it is essentially impossible to recover a prior version of a given ANES dataset after a new release has occurred. This post is meant to steer thinking about how to manage the creation, curation, revision, and dissemination of these kinds of datasets. While the ideas here might also apply to how one thinks about managing their own data, they probably apply more at the stage of data creation than at later data use after a dataset is essentially complete or frozen.