Displaying 1 to 16 from 16 results

DataflowJavaSDK - Google Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines

  •    Java

Google Cloud Dataflow SDK for Java is a distribution of Apache Beam designed to simplify usage of Apache Beam on Google Cloud Dataflow service. This artifact includes the parent POM for other Dataflow SDK artifacts.

spark-py-notebooks - Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks

  •    Jupyter

This is a collection of IPython notebook/Jupyter notebooks intended to train the reader on different Apache Spark concepts, from basic to advanced, by using the Python language. If Python is not your language, and it is R, you may want to have a look at our R on Apache Spark (SparkR) notebooks instead. Additionally, if your are interested in being introduced to some basic Data Science Engineering, you might find these series of tutorials interesting. There we explain different concepts and applications using Python and R.

pachyderm - Reproducible Data Science at Scale!

  •    Go

Pachyderm is a tool for production data pipelines. If you need to chain together data scraping, ingestion, cleaning, munging, wrangling, processing, modeling, and analysis in a sane way, then Pachyderm is for you. If you have an existing set of scripts which do this in an ad-hoc fashion and you're looking for a way to "productionize" them, Pachyderm can make this easy for you. Install Pachyderm locally or deploy on AWS/GCE/Azure in about 5 minutes.




spaCy - 💫 Industrial-strength Natural Language Processing (NLP) with Python and Cython

  •    Python

spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest research, and was designed from day one to be used in real products. spaCy comes with pre-trained statistical models and word vectors, and currently supports tokenization for 20+ languages. It features the fastest syntactic parser in the world, convolutional neural network models for tagging, parsing and named entity recognition and easy deep learning integration. It's commercial open-source software, released under the MIT license. 💫 Version 2.0 out now! Check out the new features here.

datumbox-framework - Datumbox is an open-source Machine Learning framework written in Java which allows the rapid development of Machine Learning and Statistical applications

  •    Java

Datumbox is an open-source Machine Learning Framework written in Java which allows the rapid development of Machine Learning and Statistical applications.


sciblog_support - Support content for my blog

  •    Jupyter

This repo contains the projects, additional information and code to support my blog: sciblog. You can find a list of all the post I made in this file.

DataScienceVM - Tools and Docs on the Azure Data Science Virtual Machine (http://aka.ms/dsvm)

  •    HTML

The Data Science Virtual Machine (DSVM) is a customized VM image on Microsoft’s Azure cloud built specifically for doing data science. It has many popular data science and other tools pre-installed and pre-configured to jump-start building intelligent applications for advanced analytics. It is available on Windows Server 2016, Windows Server 2012, and on Linux. We offer Linux edition of the DSVM in either Ubuntu 16.04 LTS or on OpenLogic 7.2 CentOS-based Linux distributions. You can try the Data Science VM for free for 30 days (with $200 credits) with a free Azure Trial. The Linux (Ubuntu-based) DSVM also provides a test drive through a button on the product page. The Test Drive will provide full access to you own instance of the VM with just a free Microsoft account (No Azure subscription or CC needed).On this repo, we will feature tools, tips and extensions (see below) to the Data Science VM. We invite the DSVM user community to contribute any useful tools or scripts, extensions you may have written to enhance the user experience on the DSVM.

spark-r-notebooks - R on Apache Spark (SparkR) tutorials for Big Data analysis and Machine Learning as IPython / Jupyter notebooks

  •    Jupyter

This is a collection of Jupyter notebooks intended to train the reader on different Apache Spark concepts, from basic to advanced, by using the R language. If your are interested in being introduced to some basic Data Science Engineering concepts and applications, you might find these series of tutorials interesting. There we explain different concepts and applications using Python and R. Additionally, if you are interested in using Python with Spark, you can have a look at our pySpark notebooks.

Vertica-ML-Python - Vertica-ML-Python is a Python library that exposes sci-kit like functionality to conduct data science projects on data stored in Vertica, thus taking advantage Vertica’s speed and built-in analytics and machine learning capabilities

  •    Jupyter

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. Vertica-ML-Python is a Python library that exposes sci-kit like functionality to conduct data science projects on data stored in Vertica, thus taking advantage Vertica’s speed and built-in analytics and machine learning capabilities. It supports the entire data science life cycle, uses a ‘pipeline’ mechanism to sequentialize data transformation operation (called Resilient Vertica Dataset), and offers multiple graphical rendering possibilities.

docker-kafka-alpine - Alpine Linux based Kafka Docker Image

  •    Shell

This will create a single-node kafka broker (listening on localhost:9092), a local zookeeper instance and create the topic test-topic with 1 replication-factor and 1 partition.

rsparkling - RSparkling: Use H2O Sparkling Water from R (Spark + R + Machine Learning)

  •    R

Please submit issues, questions and PRs in the new location. The current repo is not maintained. The repository has been moved for several reasons, mainly to improve the integrations with Sparkling Water and for the stability reasons.

data-science-live-book - An open source book to learn data science, data analysis and machine learning, suitable for all ages!

  •    TeX

This book is now available at Amazon in [Kindle]( Link: http://a.co/d/dIj1XwD) Black & White and color 📗 🚀. Most of the written R code can be used in real scenarios! I worked on the funModeling R package at the same time, so it is used many times in the book.