PySpark: Installation, RDDs, MLib, Broadcast and Accumulator
We aggregate and tag open source projects. We have collections of more than one million projects. Check out the projects section.
We knew that Apace Spark- the most famous parallel computing model or processing the massive data set is written in Scala programming language. The Apace foundation offered a tool to support the Python in Spark which was named PySpark. The PySpark allows us to use RDDs in Python programming language through a library called Py4j. This article provides introduction about PySpark, RDD, MLib, Broadcase and Accumulator.
The PySpark Shell connects Python APIs with spark core initiating the SparkContext making it so robust. Most of the data scientists and programmers prefer Python due to its huge set of libraries. Thus, integration of Python is not less than readymade syrup for them because their work revolves around the millions of unstructured data sets which require analysis for further prediction pattern analysis or any other task.
The pace in technology is connecting things producing billions of data from them every day. This data is driving innovation in businesses, healthcare, retail, FinTech, scientific research and thousands of other real-world scenarios from critical decision making to accurate predictions. This is the main reason behind increased demand for data scientists and engineers.
An Apache Spark Certification Training Course, blogs, videos, tutorials and various other sources accessible over the web which will help you to explore this technology in-depth so that you can uncover the hidden potential of data.
Now let’s explore PySpark installation, RDDs, machine learning library, and broadcast and accumulators in brief:
PySpark Set Up Installation:
1). First, install Java and Scala on your system.
2). Depending upon your Hadoop version download PySpark from here.
3). Extract the tar file.
> tar –xvf Downloads / spark-2.4.0-bin-hadoop2.7.tgz
4). A directory name spark-2.4.0-bin-hadoop2.7 will be created. Set up Spark and Py4j path with commands below:
> export SPARK_HOME = /home/hadoop/spark-2.4.0-bin-hadoop2.7
> export PATH = $PATH:/home/hadoop/spark-2.4.0-bin-hadoop2.7/bin
> export PYTHONPATH = $SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.3-src.zip:$PYTHONPATH
> export PATH = $SPARK_HOME/python:$PATH
You can also set these environments globally. Keep them in .bashrc file and execute the below command:
# source .bashrc
5). Since we are done with the environment setup, it’s time to invoke PySpark shell from the Spark directory. Write command:
It will initiate the PySpark shell. Now, you are good to go for implementing cool APIs.
Public classes in PySpark:
Before going further let’s understand the public classes in PySpark:
- SparkContext: key entry point for Spark.
- StorageLevel: cache persistent level.
- Broadcast: reused across multiple tasks.
- Accumulator: add-only shared variable.
- SparkFiles: accessing job files.
- TaskContext: running task status and other information
- SparkConfig: for configuration of Spark.
- RDD: the basis abstraction in Spark.
- RDDBarrier: combines RDD under a barrier stage for execution
- BarrierTaskInfo: task information
- BarrierTaskContext: extra information and tooling for barrier execution.
Resilient Distributed Dataset (RDD):
RDDs are the immutable elements which are executed and operated over the various nodes for performing parallel processing over the cluster. Since immutable so they can’t be altered once created. They are fault tolerant element which in case of failure will recover itself automatically. You can perform various operations over RDD for accomplishing a task. There are two ways for performing RDDs operations: Transformation and Action.
These are the operations for creating a new RDD from the existing one such as map, groupBy, Filter.
These operations instruct the Spark engine to accomplish computations and send output to the driver.
Broadcast and Accumulator:
Apache Spark uses the shared variables for completing parallel processing tasks. When a task is given by the driver for its execution then shared variables are distributed among each node of the cluster for accomplishing this task. These shared variables are accessible with two ways in Spark: Broadcast and Accumulator. Let’s explore them:
The role of a broadcast variable is to clone and save data in all nodes of cluster. Broadcast variables are cached over the machine. It uses the command:
Accumulators aggregate the data with commutative and associative methods. You can use it as:
class pyspark.Accumulator(aid, value, accum_param)
Here, we are using an accumulator variable. It contains attributes value, accum, and param for storing data and returning accumulator’s value. It is only usable in a driver script. The command or running an accumulator variable is:
MLib is a machine learning API offered by Spark. This API is available in Python for PySpark influencing users to leverage various kinds of algorithms such as:
It provides a method for binary classification, multiclass classification, and regression model analysis. It leverages the common classification algorithms like Random Forest, Decision Tree and Naive Bayes etc.
FPM stands for frequent pattern matching which is mining frequent items, substructures or subsequences. It is the primary step in the analysis of a massive data set.
spark.mlib relies on model-based collaborative filtering. Here, user and items are stated with a tiny set of latent factors which assists in predicting missing entries. It is based on the Alternating Least Squares (LSA) algorithms for understanding the latent factors.
It is the utility available for dealing with linear algebra problem sets.
Recommendation engines leverage the collaborative filtering for delivering the result. Its strategy is aimed towards the filling of missing entries in a user item association matrix.
Clustering falls under the unsupervised learning. Here, on the basis of the similar notion, subsets of entries are grouped together.
The main objective of regression is to identify the dependencies and relations among variables. The similar interface is used or logistic and linear regression.
Thus, we can see how useful PySpark library is. It assists us to perform most critical data-oriented tasks like classification, regression, predictive modeling, pattern analysis, and many others for modern businesses and other typical computations used in making critical decisions. It has the ability to serve both real-time and batch processing handling workloads efficiently. And, the best part is that Spark runs 100 times faster than Hadoop due to its processing in main memory which reduces the extra effort for disk-oriented input-output tasks