empujar - When you need to push data around, you push it. A node.js ETL tool.

  •        42

When you need to push data around, you push it. Push it real good. An ETL and Operations tool.Empujar's top level object is a "book", which contains "chapters" and then "pages". Chapters are excecuted 1-by-1 in order, and then each page in a chapter can be run in parallel (up to a threading limit you specify).

https://github.com/taskrabbit/empujar

Dependencies:

async : ^2.1.1
aws-sdk : ^2.10.0
dateformat : ^2.0.0
elasticsearch : ^12.1.0
filesize : ^3.4.1
ftp : ^0.3.10
glob : ^7.1.1
is-running : ^2.0.1
mkdirp : ^0.5.0
mysql : ^2.6.1
optimist : ^0.6.1
pg : ^6.1.0
request : ^2.76.0
s3-upload-stream : ^1.0.7
utf8 : ^2.1.2
winston : ^1.0.0

Tags
Implementation
License
Platform

   




Related Projects

Apache Beam - Unified model for defining both batch and streaming data-parallel processing pipelines

  •    Java

Apache Beam is an open source, unified model for defining both batch and streaming data-parallel processing pipelines. Using one of the open source Beam SDKs, you build a program that defines the pipeline. The pipeline is then executed by one of Beam’s supported distributed processing back-ends, which include Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow.

aws-lambda-redshift-loader - Amazon Redshift Database Loader implemented in AWS Lambda

  •    Javascript

With this AWS Lambda function, it's never been easier to get file data into Amazon Redshift. You simply push files into a variety of locations on Amazon S3, and have them automatically loaded into your Amazon Redshift clusters.For automated delivery of streaming data to S3 and subsequently to Redshift, also consider using Amazon Kinesis Firehose.

Apache Tajo - A big data warehouse system on Hadoop

  •    Java

Apache Tajo is a robust big data relational and distributed data warehouse system for Apache Hadoop. Tajo is designed for low-latency and scalable ad-hoc queries, online aggregation, and ETL (extract-transform-load process) on large-data sets stored on HDFS (Hadoop Distributed File System) and other data sources.

spark-redshift - Redshift data source for Apache Spark

  •    Scala

To ensure the best experience for our customers, we have decided to inline this connector directly in Databricks Runtime. The latest version of Databricks Runtime (3.0+) includes an advanced version of the RedShift connector for Spark that features both performance improvements (full query pushdown) as well as security improvements (automatic encryption). For more information, refer to the Databricks documentation. As a result, we will no longer be making releases separately from Databricks Runtime. A library to load data into Spark SQL DataFrames from Amazon Redshift, and write them back to Redshift tables. Amazon S3 is used to efficiently transfer data in and out of Redshift, and JDBC is used to automatically trigger the appropriate COPY and UNLOAD commands on Redshift.

amazon-redshift-utils - Amazon Redshift Utils contains utilities, scripts and view which are useful in a Redshift environment

  •    Python

Copyright 2014 Amazon.com, Inc. or its affiliates. All Rights Reserved.Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse solution that uses columnar storage to minimise IO, provide high data compression rates, and offer fast performance. This GitHub provides a collection of scripts and utilities that will assist you in getting the best performance possible from Amazon Redshift.


falcon-sql-client - Free, open-source SQL client for Windows and Mac 🦅

  •    Javascript

Falcon in a free, open-source SQL editor with inline data visualization. It currently supports connecting to RedShift, MySQL, PostgreSQL, IBM DB2, Impala, MS SQL, and SQLite. Visit plot.ly to learn more or visit the Plotly forum.

falcon - Free, open-source SQL client for Windows and Mac 🦅

  •    Javascript

Falcon is a free, open-source SQL editor with inline data visualization. It currently supports connecting to RedShift, MySQL, PostgreSQL, IBM DB2, Impala, MS SQL, Oracle, SQLite and more (for connecting to Oracle, please, see here the instructions to install the required free Oracle Instant Client). Visit plot.ly to learn more or visit the Plotly forum.

genie - Distributed Big Data Orchestration Service

  •    Java

Genie is a federated job orchestration engine developed by Netflix. Genie provides REST-ful APIs to run a variety of big data jobs like Hadoop, Pig, Hive, Spark, Presto, Sqoop and more. It also provides APIs for managing the metadata of many distributed processing clusters and the commands and applications which run on them.See the official website to find documentation about Genie and specific documentation for various releases.

pocket-etl - Extensible java library that orchestrates batched ETL (extract, transform and load) of data between services using native fluent java to express your pipeline

  •    Java

Extensible Java library that orchestrates batched ETL (extract, transform and load) of data between services using native fluent Java to express your pipeline.

CloverETL - Rapid Data Integration

  •    Java

Java based data integration framework can be used to transform/map/manipulate data in various formats (CSV,FIXLEN,XML,XBASE,COBOL,LOTUS, etc.); can be used standalone or embedded(as a library). Connects to RDBMS/JMS/SOAP/LDAP/S3/HTTP/FTP/ZIP/TAR.

bigdata - Introduction to Big Data

  •    TeX

Download the book in PDF or EPUB.Just like Internet, Big Data is part of our lives today. From search, online shopping, video on demand, to e-dating, Big Data always plays an important role behind the scene. Some people claim that Internet of things (IoT) will take over big data as the most hyped technology @Gartner2014. It may become true. But IoT cannot come alive without big data. In this book, we will dive deeply into big data technologies. But we need to understand what is Big Data first.

Gimel - PayPal's Big Data Processing Framework

  •    Scala

Gimel provides unified Data API to access data from any storage like HDFS, GS, Alluxio, Hbase, Aerospike, BigQuery, Druid, Elastic, Teradata, Oracle, MySQL, etc.

go-mysql-elasticsearch - Sync MySQL data into elasticsearch

  •    Go

go-mysql-elasticsearch is a service syncing your MySQL data into Elasticsearch automatically.It uses mysqldump to fetch the origin data at first, then syncs data incrementally with binlog.

x-crack - x-crack - Weak password scanner, Support: FTP/SSH/SNMP/SSQL/MYSQL/PostGreSQL/REDIS/ElasticSearch/MONGODB

  •    Go

x-crack - Weak password scanner, Support: FTP/SSH/SNMP/SSQL/MYSQL/PostGreSQL/REDIS/ElasticSearch/MONGODB

php-docker-boilerplate - :stew: PHP Docker Boilerplate for Symfony, Wordpress, Joomla or any other PHP Project (NGINX, Apache HTTPd, PHP-FPM, MySQL, Solr, Elasticsearch, Redis, FTP)

  •    Javascript

This is an easy customizable docker boilerplate for any PHP-based projects like Symfony Framework, CakePHP, Yii and many other frameworks or applications. This Docker boilerplate is based on the Docker best practices and doesn't use too much magic. Configuration of each docker container is available in the docker/ directory - feel free to customize.

TYPO3-docker-boilerplate - :stew: TYPO3 Docker Boilerplate project (NGINX, Apache HTTPd, PHP-FPM, MySQL, Solr, Elasticsearch, Redis, FTP)

  •    Shell

This is an easy customizable TYPO3 docker boilerplate. This Docker boilerplate is based on the Docker best practices and doesn't use too much magic. Configuration of each docker container is available in the docker/ directory - feel free to customize.

StratoSphere - Cloud Computing Framework for Big Data Analytics

  •    Java

The Stratosphere System is an open-source cluster/cloud computing framework for Big Data analytics. It comprises of An extensible higher level language (Meteor) to quickly compose queries for common and recurring use cases, A parallel programming model (PACT, an extension of MapReduce) to run user-defined operations, An efficient massively parallel runtime (Nephele) for fault tolerant execution of acyclic data flows.

incubator-gobblin - Gobblin is a distributed big data integration framework (ingestion, replication, compliance, retention) for batch and streaming systems

  •    Java

Apache Gobblin is a universal data ingestion framework for extracting, transforming, and loading large volume of data from a variety of data sources, e.g., databases, rest APIs, FTP/SFTP servers, filers, etc., onto Hadoop. Apache Gobblin handles the common routine tasks required for all data ingestion ETLs, including job/task scheduling, task partitioning, error handling, state management, data quality checking, data publishing, etc. Gobblin ingests data from different data sources in the same execution framework, and manages metadata of different sources all in one place. This, combined with other features such as auto scalability, fault tolerance, data quality assurance, extensibility, and the ability of handling data model evolution, makes Gobblin an easy-to-use, self-serving, and efficient data ingestion framework.





We have large collection of open source products. Follow the tags from Tag Cloud >>


Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.