Displaying 1 to 12 from 12 results

D3 - A JavaScript visualization library for HTML and SVG

  •    Javascript

D3 is a small, free JavaScript library for manipulating HTML documents based on data. D3 can help you quickly visualize your data as HTML or SVG, handle interactivity, and incorporate smooth transitions and staged animations into your pages. You can use D3 as a visualization framework (like Protovis), or you can use it to build dynamic pages (like jQuery).

Optimus - :truck: Agile Data Science Workflows made easy with Python and Spark.

  •    Python

Optimus is the missing framework to profile, clean, process and do ML in a distributed fashion using Apache Spark(PySpark). You can go to the 10 minutes to Optimus notebook where you can find the basic to start working.

Porter - :lipstick: Data import abstraction library for consuming Web APIs and other data sources.

  •    PHP

Porter is the PHP data importer. She fetches data from anywhere, from the local file system to third party online services, and returns an iterator. Porter is a fully pluggable import framework that can be extended with connectors for any protocol and transformers to manipulate data immediately after import. Ready-to-use data providers include all the necessary connectors and other dependencies to access popular online services such as Stripe for online payments, the European Central Bank for foreign exchange rates or Steam for its complete PC games library and more. Porter's provider library is limited right now, and some implementations are incomplete, but we hope the PHP community will rally around Porter's abstractions and become the de facto framework for publishing online services, APIs, web scrapers and data dumps. Porter's interfaces have undergone intensive scrutiny and several iterations during years of production use to ensure they are efficient, robust, flexible, testable and easy to implement.

Zingg - Scalable fuzzy matching for data mastering, deduplication and entity resolution

  •    Java

Zingg is a scalable fuzzy matching for data mastering, deduplication and entity resolution. Real world data contains multiple records belonging to the same customer. These records can be in single or multiple systems and they have variations across fields which makes it hard to combine them together, especially with growing data volumes. Zingg integrates different records of an entity like customer, patient, supplier, product etc in same or disparate data sources.

pglogical - Logical Replication extension for PostgreSQL 9

  •    C

The pglogical extension provides logical streaming replication for PostgreSQL, using a publish/subscribe model. It is based on technology developed as part of the BDR project (http://2ndquadrant.com/BDR). To use pglogical the provider and subscriber must be running PostgreSQL 9.4 or newer.

prose - Microsoft Program Synthesis using Examples SDK is a framework of technologies for the automatic generation of programs from input-output examples

  •    CSharp

The Program Synthesis using Examples (PROSE) SDK includes a set of technologies for the automatic generation of programs from input-output examples. This repo includes samples and sample data for the Microsoft PROSE SDK.This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

php-serializer - Serialize PHP variables, including objects, in any format

  •    PHP

In the context of data storage, serialization is the process of translating data structures or object state into a format that can be stored (for example, in a file or memory buffer, or transmitted across a network connection link) and reconstructed later in the same or another computer environment. These native functions rely on having the serialized classes loaded and available at runtime and tie your unserialization process to a PHP platform.

pycsvw - A tool to read CSV files with CSVW metadata and transform them into other formats.

  •    Python

Python implementation of a variant of the W3C CSV on the Web specification, primarily for efficient RDF and JSON generation from a CSV file and its metadata. The supported variant of the recommendation has some additional features, mostly around specifying RDF to be an ordered container, and also some restrictions as listed below. All outputs are generated in UTF-8 encoding.

sjmisc - Data transformation and utility functions for R

  •    R

Data preparation is a common task in research, which usually takes the most amount of time in the analytical process. Packages for data preparation have been released recently as part of the tidyverse, focussing on the transformation of data sets. Packages with special focus on transformation of variables, which fit into the workflow and design-philosophy of the tidyverse, are missing. sjmisc tries to fill this gap. Basically, this package complements the dplyr package in that sjmisc takes over data transformation tasks on variables, like recoding, dichotomizing or grouping variables, setting and replacing missing values, etc. A distinctive feature of sjmisc is the support for labelled data, which is especially useful for users who often work with data sets from othert statistical software packages like SPSS or Stata.

sqawk - Like Awk, but with SQL and table joins

  •    Perl

Sqawk is an Awk-like program that uses SQL and can combine data from multiple files. It is powered by SQLite. where the script is your SQL.

temme - 📄 Concise selector to extract JSON from HTML.

  •    TypeScript

如果你对 temme 还不熟悉,那么可以从 豆瓣电影的例子 开始。在线版本中也包含了一些其他较短的例子。比如这个例子从豆瓣电影页面中抓取了电影的基本信息和评分信息。这个例子从天猫的商品详情页面中抓取了评论列表,包括用户的基本信息,初次评价和追加评价, 以及晒的照片的链接.

aws-dbs-refarch-datalake - Reference Architectures for Datalakes on AWS

  •    HTML

A datalake is a data repository that stores data in its raw format until it is used for analytics. It is designed to store massive amount of data at scale. A schema to the dataset in data lake is given as part of transformation while reading it. Below is a pictorial representation of a typical datalake on AWS cloud. Keeping track of all of the raw assets that are loaded into your datalake, and then tracking all of the new data assets and versions that are created by data transformation, data processing, and analytics can be a major challenge. An essential component of an Amazon S3 based data lake is a Data Catalog. A data catalog is designed to provide a single source of truth about the contents of the data lake, and rather than end users reasoning about storage buckets and prefixes, a data catalog lets them interact with more familiar structures of databases, tables, and partitions.

We have large collection of open source products. Follow the tags from Tag Cloud >>

Open source products are scattered around the web. Please provide information about the open source projects you own / you use. Add Projects.