This is a collection of IPython notebook/Jupyter notebooks intended to train the reader on different Apache Spark concepts, from basic to advanced, by using the Python language. If Python is not your language, and it is R, you may want to have a look at our R on Apache Spark (SparkR) notebooks instead. Additionally, if your are interested in being introduced to some basic Data Science Engineering, you might find these series of tutorials interesting. There we explain different concepts and applications using Python and R.
spark pyspark data-analysis mllib ipython-notebook notebook ipython data-science machine-learning big-data bigdataLinkis helps easily connect to various back-end computation/storage engines
sql spark presto hive storage jdbc rest-api engine impala pyspark udf thrift-server resource-manager jobserver application-manager livy hive-table linkis context-service scriptisScriptis is for interactive data analysis with script development(SQL, Pyspark, HiveQL), task submission(Spark, Hive), UDF, function, resource management and intelligent diagnosis. Script editor: Support multi-language, auto-completion, syntax highlighting and SQL syntax error-correction.
sql spark hive ide pyspark udf hue zeppelin hql hive-table resouce-management linkis errorcodeA curated list of awesome Apache Spark packages and resources. Apache Spark is an open-source cluster-computing framework. Originally developed at the University of California, Berkeley's AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since. Spark provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance (Wikipedia 2017).
apache-spark pyspark awesome sparkrSparkmagic is a set of tools for interactively working with remote Spark clusters through Livy, a Spark REST server, in Jupyter notebooks. The Sparkmagic project includes a set of magics for interactively running Spark code in multiple languages, as well as some kernels that you can use to turn Jupyter into an integrated Spark environment. There are two ways to use sparkmagic. Head over to the examples section for a demonstration on how to use both models of execution.
spark kernel cluster livy magic sql-query pandas-dataframe jupyter pyspark kerberos notebook jupyter-notebookMMLSpark provides a number of deep learning and data science tools for Apache Spark, including seamless integration of Spark Machine Learning pipelines with Microsoft Cognitive Toolkit (CNTK) and OpenCV, enabling you to quickly create powerful, highly-scalable predictive and analytical models for large image and text datasets.MMLSpark requires Scala 2.11, Spark 2.1+, and either Python 2.7 or Python 3.5+. See the API documentation for Scala and for PySpark.
machine-learning spark cntk pyspark azure microsoft-machine-learning microsoft mlJohn Snow Labs Spark-NLP is a natural language processing library built on top of Apache Spark ML. It provides simple, performant & accurate NLP annotations for machine learning pipelines, that scale easily in a distributed environment. This library has been uploaded to the spark-packages repository https://spark-packages.org/package/JohnSnowLabs/spark-nlp .
nlp nlu natural-language-processing natural-language-understanding spark spark-ml pyspark machine-learning named-entity-recognition sentiment-analysis lemmatizer spell-checker tokenizer entity-extraction stemmer part-of-speech-tagger annotation-frameworkOptimus is the missing framework to profile, clean, process and do ML in a distributed fashion using Apache Spark(PySpark). You can go to the 10 minutes to Optimus notebook where you can find the basic to start working.
spark pyspark data-wrangling bigdata big-data-cleaning data-science cleansing data-cleansing data-cleaner apache-spark data-transformationThis repository is a collection of ETL jobs for Firefox Telemetry.Jobs committed to python_mozet can be scheduled via airflow or ATMO. We provide a testing suite and code review, which makes your job more maintainable. Centralizing our jobs in one repository allows for code reuse and easier collaboration.
firefox-telemetry etl pysparkIf you want to install and run everything on your computer, here are the best tutorials I've found for getting Python and Spark running on your computer. In order to visualize the decision trees in Jupyter, you will need to install Graphviz as well as the Python package.
big-data machine-learning jupyter-notebook graphviz data-exploration pyspark mllibDownload the slides from Slideshare. This code uses the Hazardous Air Pollutants dataset from Kaggle.
spark pyspark data-science machine-learning analyticsRun your docker with docker-compose. It helps to keep your arguments/settings in a single file and run together in an isolated environment.
docker docker-image pyspark-notebook spark pyspark docker-compose notebook apache-spark bigdata python-notebook jupyter-notebookThis workshop will introduce you to Apache Spark via the exciting domain of Geospatial Analysis.
pyspark geospatial-analysis spark dockerTested with Spark 2.1.0 - 2.3.0 in combination with Python 2.7 and/or 3.5.
spark warc-files wet commoncrawl sparksql pyspark wat-filesReplicates typical Kafka stack using docker compose.
kafka spark twitter docker-compose avro kafka-connect pysparkAn open-source platform for managing and analyzing web archives
spark hadoop webarchives analysis apache-spark digital-humanities pyspark dataframejgit-spark-connector is a library for running scalable data retrieval pipelines that process any number of Git repositories for source code analysis. It is written in Scala and built on top of Apache Spark to enable rapid construction of custom analysis pipelines and processing large number of Git repositories stored in HDFS in Siva file format. It is accessible both via Scala and Python Spark APIs, and capable of running on large-scale distributed clusters.
spark pyspark git datasourceA few of the Big Data, NoSQL & Linux tools I've written over the years. All programs have --help to list the available options. For many more tools see the DevOps Perl Tools and Advanced Nagios Plugins Collection repos which contains many Hadoop, NoSQL, Web and infrastructure tools and Nagios plugins.
ambari cloudformation hbase json avro parquet spark pyspark travis-ci pig elasticsearch solr xml hadoop hdfs dockerhub docker awsIn this repo, I try to use Spark (PySpark) to look into a downloading log file in .CSV format. This repo can be considered as an introduction to the very basic functions of Spark. It may be helpful for those who are beginners to Spark. Additionally, we're using a real log file as sample data in this tutorial and trying to cover some operations commonly used in daily works. If you would like to get to know more operations with minimal sample data, you can refer to a seperate script I prepared, Basic Operations in PySpark.
spark pyspark[UNMAINTAINED] 基于PySpark与MySQL的复杂网络链路预测。
link-prediction spark pyspark network
We have large collection of open source products. Follow the tags from
Tag Cloud >>
Open source products are scattered around the web. Please provide information
about the open source projects you own / you use.
Add Projects.