Displaying 1 to 20 from 23 results

spark-py-notebooks - Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks


This is a collection of IPython notebook/Jupyter notebooks intended to train the reader on different Apache Spark concepts, from basic to advanced, by using the Python language. If Python is not your language, and it is R, you may want to have a look at our R on Apache Spark (SparkR) notebooks instead. Additionally, if your are interested in being introduced to some basic Data Science Engineering, you might find these series of tutorials interesting. There we explain different concepts and applications using Python and R.

awesome-spark - A curated list of awesome Apache Spark packages and resources.


A curated list of awesome Apache Spark packages and resources. Apache Spark is an open-source cluster-computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance (Wikipedia 2017).

MMLSpark - Microsoft Machine Learning for Apache Spark


MMLSpark provides a number of deep learning and data science tools for Apache Spark, including seamless integration of Spark Machine Learning pipelines with Microsoft Cognitive Toolkit (CNTK) and OpenCV, enabling you to quickly create powerful, highly-scalable predictive and analytical models for large image and text datasets.MMLSpark requires Scala 2.11, Spark 2.1+, and either Python 2.7 or Python 3.5+. See the API documentation for Scala and for PySpark.

spark-nlp - Natural Language Understanding Library for Apache Spark.


John Snow Labs Spark-NLP is a natural language processing library built on top of Apache Spark ML. It provides simple, performant & accurate NLP annotations for machine learning pipelines, that scale easily in a distributed environment. This library has been uploaded to the spark-packages repository https://spark-packages.org/package/JohnSnowLabs/spark-nlp .




python_mozetl - ETL jobs for Firefox Telemetry


This repository is a collection of ETL jobs for Firefox Telemetry.Jobs committed to python_mozet can be scheduled via airflow or ATMO. We provide a testing suite and code review, which makes your job more maintainable. Centralizing our jobs in one repository allows for code reuse and easier collaboration.

data-analytics-machine-learning-big-data - Slides, code and more for my class: Data Analytics and Machine Learning on Big Data


If you want to install and run everything on your computer, here are the best tutorials I've found for getting Python and Spark running on your computer. In order to visualize the decision trees in Jupyter, you will need to install Graphviz as well as the Python package.

pyspark-notebook - Pyspark Notebook With Docker


Run your docker with docker-compose. It helps to keep your arguments/settings in a single file and run together in an isolated environment.


PySparkGeoAnalysis - :globe_with_meridians: Interactive Workshop on GeoAnalysis using PySpark


This workshop will introduce you to Apache Spark via the exciting domain of Geospatial Analysis.

jgit-spark-connector - jgit-spark-connector is a library for running scalable data retrieval pipelines that process any number of Git repositories for source code analysis


jgit-spark-connector is a library for running scalable data retrieval pipelines that process any number of Git repositories for source code analysis. It is written in Scala and built on top of Apache Spark to enable rapid construction of custom analysis pipelines and processing large number of Git repositories stored in HDFS in Siva file format. It is accessible both via Scala and Python Spark APIs, and capable of running on large-scale distributed clusters.

devops-python-tools - DevOps CLI Tools for Hadoop, Spark, HBase, Log Anonymizer, Ambari Blueprints, AWS CloudFormation, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Elasticsearch, Solr, Travis CI, Pig, IPython - Python / Jython Tools


A few of the Big Data, NoSQL & Linux tools I've written over the years. All programs have --help to list the available options. For many more tools see the DevOps Perl Tools and Advanced Nagios Plugins Collection repos which contains many Hadoop, NoSQL, Web and infrastructure tools and Nagios plugins.

Spark-practice - Apache Spark (PySpark) Practice on Real Data


In this repo, I try to use Spark (PySpark) to look into a downloading log file in .CSV format. This repo can be considered as an introduction to the very basic functions of Spark. It may be helpful for those who are beginners to Spark. Additionally, we're using a real log file as sample data in this tutorial and trying to cover some operations commonly used in daily works. If you would like to get to know more operations with minimal sample data, you can refer to a seperate script I prepared, Basic Operations in PySpark.

learn-by-examples - Real-world Spark pipelines examples


This repository serves as base to learn spark using example from real-world data sets. learn-by-examples by Elias Abou Haydar and Maciej Szymkiewicz is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Based on a work at https://github.com/awesome-spark/learn-by-examples.

spark-gotchas - Spark Gotchas. A subjective compilation of the Apache Spark tips and tricks


This work, excluding code examples, is licensed under Creative Commons Attribution-ShareAlike 4.0 International license. Accompanying code and code snippets are licensed under MIT license.

tdigest - t-Digest data structure in Python


This is a Python implementation of Ted Dunning's t-digest data structure. The t-digest data structure is designed around computing accurate estimates from either streaming data, or distributed data. These estimates are percentiles, quantiles, trimmed means, etc. Two t-digests can be added, making the data structure ideal for map-reduce settings, and can be serialized into much less than 10kB (instead of storing the entire list of data). tdigest is compatible with both Python 2 and Python 3.

pyspark-stubs - A collection of the Apache Spark stub files.


A collection of the Apache Spark stub files. These files were generated by stubgen and manually edited to include accurate type hints. Tests and configuration files have been originally contributed to the Typeshed project. Please refer to its contributors list and license for details.