Displaying 1 to 20 from 23 results

Gaffer - A large-scale entity and relation database supporting aggregation of properties

  •    Java

Gaffer is a graph database framework. It allows the storage of very large graphs containing rich properties on the nodes and edges. Several storage options are available, including Accumulo, Hbase and Parquet. It is designed to be as flexible, scalable and extensible as possible, allowing for rapid prototyping and transition to production systems.

incubator-hudi - Upserts And Incremental Processing on Big Data

  •    Java

Hoodie is a Apache Spark library that provides the ability to efficiently do incremental processing on datasets in HDFS

hoodie - Spark Library for Hadoop Upserts And Incrementals

  •    Java

Hoodie is a Apache Spark library that provides the ability to efficiently do incremental processing on datasets in HDFS

iceberg - Iceberg is a table format for large, slow-moving tabular data

  •    Java

Iceberg is a new table format for storing large, slow-moving tabular data. It is designed to improve on the de-facto standard table layout built into Hive, Presto, and Spark.Iceberg is under active development at Netflix.




node-parquet - NodeJS module to access apache parquet format files

  •    C++

Parquet is a columnar storage format available to any project in the Hadoop ecosystem. This nodejs module provides native bindings to the parquet functions from parquet-cpp.A pure javascript parquet format driver (still in development) is also provided.

skale - High performance distributed data processing engine

  •    Javascript

High performance distributed data processing and machine learning.Skale provides a high-level API in Javascript and an optimized parallel execution engine on top of NodeJS.

gcs-tools - GCS support for avro-tools and parquet-tools

  •    Java

Light weight wrapper that adds Google Cloud Storage (GCS) support to common Hadoop tools, including avro-tools, parquet-tools and proto-tools for Scio's Protobuf in Avro file, so that they can be used from regular workstations or laptops, outside of a Google Compute Engine (GCE) instance.It uses your existing OAuth2 credentials and allows authentication via a browser.

ratatool - A tool for random data sampling and generation

  •    Scala

Or download the release jar and run it.The command line tool can be used to sample from local file system or Google Cloud Storage directly if Google Cloud SDK is installed and authenticated.


parquet-avro-extra - Scala macros for generating Parquet schema projections and filter predicates

  •    Scala

Scala macros for generating Parquet column projections and filter predicates.

parquet-rs - Apache Parquet implementation in Rust

  •    Rust

See crate documentation on available API. To update Parquet format to a newer version, check if parquet-format version is available. Then simply update version of parquet-format crate in Cargo.toml.

hudi - Spark Library for Hadoop Upserts And Incrementals

  •    Java

Hoodie is a Apache Spark library that provides the ability to efficiently do incremental processing on datasets in HDFS

eel-sdk - Big Data Toolkit for the JVM

  •    Scala

Eel is a toolkit for manipulating data in the hadoop ecosystem. By hadoop ecosystem we mean file formats common to the big-data world, such as parquet, orc, csv in locations such as HDFS or Hive tables. In contrast to distributed batch or streaming engines such as Spark or Flink, Eel is an SDK intended to be used directly in process. Eel is a lower level API than higher level engines like Spark and is aimed for those use cases when you want something like a file API. Here are some of our notes comparing eel to other tools that offer functionality similar to eel.

Parquet.jl - Julia implementation of parquet columnar file format reader and writer

  •    Julia

Load a parquet file. Only metadata is read initially, data is loaded in chunks on demand. Examine the schema.

devops-python-tools - DevOps CLI Tools for Hadoop, Spark, HBase, Log Anonymizer, Ambari Blueprints, AWS CloudFormation, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Elasticsearch, Solr, Travis CI, Pig, IPython - Python / Jython Tools

  •    Python

A few of the Big Data, NoSQL & Linux tools I've written over the years. All programs have --help to list the available options. For many more tools see the DevOps Perl Tools and Advanced Nagios Plugins Collection repos which contains many Hadoop, NoSQL, Web and infrastructure tools and Nagios plugins.

drill-test-framework - Test Framework for Apache Drill

  •    Eiffel

Test Framework for SQL on Hadoop technologies. Currently supports Apache Drill, a schema-free SQL query engine for Hadoop, NoSQL and cloud storage. The framework is built for regression, integration & sanity testing. Includes test coverage (with baselines) for core Drill functionality, and supported features. And are used by the Apache Drill community for pre-commit regression and part of the release criteria.

OAP - Optimized Analytics Package for Spark Platform

  •    Scala

OAP - Optimized Analytics Package (previously known as Spinach) is designed to accelerate Ad-hoc query. OAP defines a new parquet-like columnar storage data format and offering a fine-grained hierarchical cache mechanism in the unit of “Fiber” in memory. What’s more, OAP has extended the Spark SQL DDL to allow user to define the customized indices based on relation. By defaut, it builds for Spark 2.1.0. To specify the Spark version, please use profile spark-2.1 or spark-2.2.

aws-data-wrangler - The missing link between AWS services and the most popular Python data libraries

  •    Python

The missing link between AWS services and the most popular Python data libraries. AWS Data Wrangler aims to fill a gap between AWS Analytics Services (Glue, Athena, EMR, Redshift) and the most popular Python libraries for lightweight workloads.

sparksql-protobuf - Read SparkSQL parquet file as RDD[Protobuf]

  •    Scala

This library provides utilities to work with Protobuf objects in SparkSQL. It provides a way to read parquet file written by SparkSQL back as an RDD of compatible protobuf object. It can also converts RDD of protobuf objects into DataFrame. where we need SparkContext, parquet path and protobuf class.

pucket - Bucketing and partitioning system for Parquet

  •    Scala

Parquet + Bucket = Pucket. Pucket is Scala library which provides a simple partitioning system for Parquet.