Displaying 1 to 20 from 27 results

spark-py-notebooks - Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks

  •    Jupyter

This is a collection of IPython notebook/Jupyter notebooks intended to train the reader on different Apache Spark concepts, from basic to advanced, by using the Python language. If Python is not your language, and it is R, you may want to have a look at our R on Apache Spark (SparkR) notebooks instead. Additionally, if your are interested in being introduced to some basic Data Science Engineering, you might find these series of tutorials interesting. There we explain different concepts and applications using Python and R.

awesome-spark - A curated list of awesome Apache Spark packages and resources.

  •    

A curated list of awesome Apache Spark packages and resources. Apache Spark is an open-source cluster-computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance (Wikipedia 2017).

sparkmagic - Jupyter magics and kernels for working with remote Spark clusters

  •    Python

Sparkmagic is a set of tools for interactively working with remote Spark clusters through Livy, a Spark REST server, in Jupyter notebooks. The Sparkmagic project includes a set of magics for interactively running Spark code in multiple languages, as well as some kernels that you can use to turn Jupyter into an integrated Spark environment. There are two ways to use sparkmagic. Head over to the examples section for a demonstration on how to use both models of execution.

MMLSpark - Microsoft Machine Learning for Apache Spark

  •    Scala

MMLSpark provides a number of deep learning and data science tools for Apache Spark, including seamless integration of Spark Machine Learning pipelines with Microsoft Cognitive Toolkit (CNTK) and OpenCV, enabling you to quickly create powerful, highly-scalable predictive and analytical models for large image and text datasets.MMLSpark requires Scala 2.11, Spark 2.1+, and either Python 2.7 or Python 3.5+. See the API documentation for Scala and for PySpark.




spark-nlp - Natural Language Understanding Library for Apache Spark.

  •    Jupyter

John Snow Labs Spark-NLP is a natural language processing library built on top of Apache Spark ML. It provides simple, performant & accurate NLP annotations for machine learning pipelines, that scale easily in a distributed environment. This library has been uploaded to the spark-packages repository https://spark-packages.org/package/JohnSnowLabs/spark-nlp .

Optimus - :truck: Agile Data Science Workflows made easy with Python and Spark.

  •    Python

Optimus is the missing framework to profile, clean, process and do ML in a distributed fashion using Apache Spark(PySpark). You can go to the 10 minutes to Optimus notebook where you can find the basic to start working.

python_mozetl - ETL jobs for Firefox Telemetry

  •    Python

This repository is a collection of ETL jobs for Firefox Telemetry.Jobs committed to python_mozet can be scheduled via airflow or ATMO. We provide a testing suite and code review, which makes your job more maintainable. Centralizing our jobs in one repository allows for code reuse and easier collaboration.

data-analytics-machine-learning-big-data - Slides, code and more for my class: Data Analytics and Machine Learning on Big Data

  •    Jupyter

If you want to install and run everything on your computer, here are the best tutorials I've found for getting Python and Spark running on your computer. In order to visualize the decision trees in Jupyter, you will need to install Graphviz as well as the Python package.


pyspark-for-data-processing - Code for my presentation: Using PySpark to Process Boat Loads of Data

  •    Jupyter

Download the slides from Slideshare. This code uses the Hazardous Air Pollutants dataset from Kaggle.

pyspark-notebook - Pyspark Notebook With Docker

  •    Python

Run your docker with docker-compose. It helps to keep your arguments/settings in a single file and run together in an isolated environment.

PySparkGeoAnalysis - :globe_with_meridians: Interactive Workshop on GeoAnalysis using PySpark

  •    Jupyter

This workshop will introduce you to Apache Spark via the exciting domain of Geospatial Analysis.

jgit-spark-connector - jgit-spark-connector is a library for running scalable data retrieval pipelines that process any number of Git repositories for source code analysis

  •    Scala

jgit-spark-connector is a library for running scalable data retrieval pipelines that process any number of Git repositories for source code analysis. It is written in Scala and built on top of Apache Spark to enable rapid construction of custom analysis pipelines and processing large number of Git repositories stored in HDFS in Siva file format. It is accessible both via Scala and Python Spark APIs, and capable of running on large-scale distributed clusters.

devops-python-tools - DevOps CLI Tools for Hadoop, Spark, HBase, Log Anonymizer, Ambari Blueprints, AWS CloudFormation, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Elasticsearch, Solr, Travis CI, Pig, IPython - Python / Jython Tools

  •    Python

A few of the Big Data, NoSQL & Linux tools I've written over the years. All programs have --help to list the available options. For many more tools see the DevOps Perl Tools and Advanced Nagios Plugins Collection repos which contains many Hadoop, NoSQL, Web and infrastructure tools and Nagios plugins.

Spark-practice - Apache Spark (PySpark) Practice on Real Data

  •    Jupyter

In this repo, I try to use Spark (PySpark) to look into a downloading log file in .CSV format. This repo can be considered as an introduction to the very basic functions of Spark. It may be helpful for those who are beginners to Spark. Additionally, we're using a real log file as sample data in this tutorial and trying to cover some operations commonly used in daily works. If you would like to get to know more operations with minimal sample data, you can refer to a seperate script I prepared, Basic Operations in PySpark.

learn-by-examples - Real-world Spark pipelines examples

  •    Scala

This repository serves as base to learn spark using example from real-world data sets. learn-by-examples by Elias Abou Haydar and Maciej Szymkiewicz is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Based on a work at https://github.com/awesome-spark/learn-by-examples.

spark-gotchas - Spark Gotchas. A subjective compilation of the Apache Spark tips and tricks

  •    

This work, excluding code examples, is licensed under Creative Commons Attribution-ShareAlike 4.0 International license. Accompanying code and code snippets are licensed under MIT license.