Displaying 1 to 20 from 25 results

spaCy - 💫 Industrial-strength Natural Language Processing (NLP) with Python and Cython

  •    Python

spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest research, and was designed from day one to be used in real products. spaCy comes with pre-trained statistical models and word vectors, and currently supports tokenization for 20+ languages. It features the fastest syntactic parser in the world, convolutional neural network models for tagging, parsing and named entity recognition and easy deep learning integration. It's commercial open-source software, released under the MIT license. 💫 Version 2.0 out now! Check out the new features here.

vue-virtual-scroll-list - A vue component that support big data list with high scroll performance.

  •    Javascript

If you are looking for a vue component which support big data list and high scroll performance, you are in the right place. Tiny and very very easy to use.

TrailDB - Efficient tool for storing and querying series of events

  •    C

TrailDB is a library, implemented in C, which allows you to query series of events at blazing speed. TrailDB is also optimized for speed of development: Use its simple API with your favorite language, in your favorite environment. TrailDB's secret sauce is data compression. It leverages predictability of time-based data to compress your data to a fraction of its original size. In contrast to traditional compression, you can query the encoded data directly, decompressing only the parts you need.

DataScienceVM - Tools and Docs on the Azure Data Science Virtual Machine (http://aka.ms/dsvm)

  •    HTML

The Data Science Virtual Machine (DSVM) is a customized VM image on Microsoft’s Azure cloud built specifically for doing data science. It has many popular data science and other tools pre-installed and pre-configured to jump-start building intelligent applications for advanced analytics. It is available on Windows Server 2016, Windows Server 2012, and on Linux. We offer Linux edition of the DSVM in either Ubuntu 16.04 LTS or on OpenLogic 7.2 CentOS-based Linux distributions. You can try the Data Science VM for free for 30 days (with $200 credits) with a free Azure Trial. The Linux (Ubuntu-based) DSVM also provides a test drive through a button on the product page. The Test Drive will provide full access to you own instance of the VM with just a free Microsoft account (No Azure subscription or CC needed).On this repo, we will feature tools, tips and extensions (see below) to the Data Science VM. We invite the DSVM user community to contribute any useful tools or scripts, extensions you may have written to enhance the user experience on the DSVM.

conjure-up - Deploying complex solutions, magically.

  •    Python

Installing big software like whoa.This is the runtime application for processing spells to get those big software solutions up and going with as little hindrance as possible.

AzureDataLake - Samples and Docs for Azure Data Lake Store and Analytics


This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

usql - U-SQL Examples and Issue Tracking

  •    CSharp

U-SQL is a new language from Microsoft for processing big data. U-SQL combines the familiar syntax of SQL with the expressiveness of custom code written in C#, on top of a scale-out runtime that can handle any size data.

trck - Query engine for TrailDB

  •    C

trck is a tool to query TrailDBs for aggregate metrics based on individual user behavior. trck is a domain specific language that defines a finite state machine1 to find patterns in data. These programs are compiled into highly optimized parallel native code.

webhdfs - Node.js WebHDFS REST API client

  •    Javascript

Hadoop WebHDFS REST API (2.2.0) client library for node.js with fs module like (asynchronous) interface.

big-data-lite - Samples to the Oracle Big Data Lite VM

  •    Java

The samples contained in this repo are used in Oracle Big Data Lite VM. Each branch is associated with a Big Data Lite Version; version 4.3.0 is the first release that is using github. This repository includes scripts to quickly install third-party software that is useful to play with some demos. Please see the README in the thirdparty directory.

countly-sdk-js - Countly Product Analytics SDK for Icenium and Phonegap

  •    Java

Questions? Visit http://community.count.ly. Countly is an innovative, real-time, open source mobile analytics and push notifications platform. It collects data from mobile devices, and visualizes this information to analyze mobile application usage and end-user behavior. There are two parts of Countly: the server that collects and analyzes data, and mobile SDK that sends this data. Both parts are open source with different licensing terms.

docker-kafka-alpine - Alpine Linux based Kafka Docker Image

  •    Shell

This will create a single-node kafka broker (listening on localhost:9092), a local zookeeper instance and create the topic test-topic with 1 replication-factor and 1 partition.

SGDLibrary - MATLAB library for stochastic optimization algorithms: Version 1.0.17

  •    Terra

The SGDLibrary is a pure-MATLAB library of a collection of stochastic optimization algorithms. This solves an unconstrained minimization problem of the form, min f(x) = sum_i f_i(x). The SGDLibrary is also operable on GNU Octave (Free software compatible with many MATLAB scripts). Note that this SGDLibrary internally contains the GDLibrary.

aws-etl-orchestrator - A serverless architecture for orchestrating ETL jobs in arbitrarily-complex workflows using AWS Step Functions and AWS Lambda

  •    Python

Extract, transform, and load (ETL) operations collectively form the backbone of any modern enterprise data lake. It transforms raw data into useful datasets and, ultimately, into actionable insight. An ETL job typically reads data from one or more data sources, applies various transformations to the data, and then writes the results to a target where data is ready for consumption. The sources and targets of an ETL job could be relational databases in Amazon Relational Database Service (Amazon RDS) or on-premises, a data warehouse such as Amazon Redshift, or object storage such as Amazon Simple Storage Service (Amazon S3) buckets. Amazon S3 as a target is especially commonplace in the context of building a data lake in AWS. AWS offers AWS Glue, which is a service that helps author and deploy ETL jobs. AWS Glue is a fully managed extract, transform, and load service that makes it easy for customers to prepare and load their data for analytics. Other AWS Services also can be used to implement and manage ETL jobs. They include: AWS Database Migration Service (AWS DMS), Amazon EMR (using the Steps API), and even Amazon Athena.

AverageShiftedHistograms.jl - ASH density estimation in pure Julia

  •    Julia

Lightning fast density estimation in Julia. An Averaged Shifted Histogram (ASH) is essentially Kernel Density Estimation over a fine-partition histogram. ASH only requires constant memory and can be constructed on-line, allowing you to estimate distributions for arbitrarily big data.