PySpark: Installation, RDDs, MLib, Broadcast and Accumulator

  •        0
  

We aggregate and tag open source projects. We have collections of more than one million projects. Check out the projects section.



We knew that Apace Spark- the most famous parallel computing model or processing the massive data set is written in Scala programming language. The Apace foundation offered a tool to support the Python in Spark which was named PySpark. The PySpark allows us to use RDDs in Python programming language through a library called Py4j. This article provides introduction about PySpark, RDD, MLib, Broadcase and Accumulator.

The PySpark Shell connects Python APIs with spark core initiating the SparkContext making it so robust. Most of the data scientists and programmers prefer Python due to its huge set of libraries. Thus, integration of Python is not less than readymade syrup for them because their work revolves around the millions of unstructured data sets which require analysis for further prediction pattern analysis or any other task.

The pace in technology is connecting things producing billions of data from them every day. This data is driving innovation in businesses, healthcare, retail, FinTech, scientific research and thousands of other real-world scenarios from critical decision making to accurate predictions. This is the main reason behind increased demand for data scientists and engineers.

An Apache Spark Certification Training Course, blogs, videos, tutorials and various other sources accessible over the web which will help you to explore this technology in-depth so that you can uncover the hidden potential of data.

Now let’s explore PySpark installation, RDDs, machine learning library, and broadcast and accumulators in brief:

PySpark Set Up Installation:

1). First, install Java and Scala on your system.

2). Depending upon your Hadoop version download PySpark from here.

3). Extract the tar file.

> tar –xvf Downloads / spark-2.4.0-bin-hadoop2.7.tgz

4). A directory name spark-2.4.0-bin-hadoop2.7 will be created. Set up Spark and Py4j path with commands below:

>  export SPARK_HOME = /home/hadoop/spark-2.4.0-bin-hadoop2.7

> export PATH = $PATH:/home/hadoop/spark-2.4.0-bin-hadoop2.7/bin

> export PYTHONPATH = $SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.3-src.zip:$PYTHONPATH

> export PATH = $SPARK_HOME/python:$PATH

You can also set these environments globally. Keep them in .bashrc file and execute the below command:

# source .bashrc 

5). Since we are done with the environment setup, it’s time to invoke PySpark shell from the Spark directory. Write command:

# ./bin/pyspark 

It will initiate the PySpark shell. Now, you are good to go for implementing cool APIs.

 

Public classes in PySpark:

Before going further let’s understand the public classes in PySpark:

 

  • SparkContext: key entry point for Spark.
  • StorageLevel: cache persistent level.
  • Broadcast: reused across multiple tasks.
  • Accumulator: add-only shared variable.
  • SparkFiles: accessing job files.
  • TaskContext: running task status and other information
  • SparkConfig: for configuration of Spark.
  • RDD: the basis abstraction in Spark.
  • RDDBarrier: combines RDD under a barrier stage for execution
  • BarrierTaskInfo: task information
  • BarrierTaskContext: extra information and tooling for barrier execution.

 

Resilient Distributed Dataset (RDD):

RDDs are the immutable elements which are executed and operated over the various nodes for performing parallel processing over the cluster. Since immutable so they can’t be altered once created. They are fault tolerant element which in case of failure will recover itself automatically. You can perform various operations over RDD for accomplishing a task. There are two ways for performing RDDs operations: Transformation and Action.

Transformation:

These are the operations for creating a new RDD from the existing one such as map, groupBy, Filter.

Action:

These operations instruct the Spark engine to accomplish computations and send output to the driver.

Broadcast and Accumulator:

Apache Spark uses the shared variables for completing parallel processing tasks. When a task is given by the driver for its execution then shared variables are distributed among each node of the cluster for accomplishing this task. These shared variables are accessible with two ways in Spark: Broadcast and Accumulator. Let’s explore them:

Broadcast:

The role of a broadcast variable is to clone and save data in all nodes of cluster. Broadcast variables are cached over the machine. It uses the command:

$SPARK_HOME/bin/spark-submit broadcast.py 

Accumulator:

Accumulators aggregate the data with commutative and associative methods.  You can use it as:

class pyspark.Accumulator(aid, value, accum_param) 

Here, we are using an accumulator variable. It contains attributes value, accum, and param for storing data and returning accumulator’s value. It is only usable in a driver script. The command or running an accumulator variable is:

$SPARK_HOME/bin/spark-submit accumulator.py

MLib:

MLib is a machine learning API offered by Spark. This API is available in Python for PySpark influencing users to leverage various kinds of algorithms such as:

mlib.classification:

It provides a method for binary classification, multiclass classification, and regression model analysis. It leverages the common classification algorithms like Random Forest, Decision Tree and Naive Bayes etc.

mlib.fpm:

FPM stands for frequent pattern matching which is mining frequent items, substructures or subsequences. It is the primary step in the analysis of a massive data set.

spark.mlib:

spark.mlib relies on model-based collaborative filtering. Here, user and items are stated with a tiny set of latent factors which assists in predicting missing entries. It is based on the Alternating Least Squares (LSA) algorithms for understanding the latent factors.

mlib.linalg:

It is the utility available for dealing with linear algebra problem sets.

mlib.recommendation:

Recommendation engines leverage the collaborative filtering for delivering the result. Its strategy is aimed towards the filling of missing entries in a user item association matrix.

mlib.clustering:

Clustering falls under the unsupervised learning. Here, on the basis of the similar notion, subsets of entries are grouped together.

mlib.regression:

The main objective of regression is to identify the dependencies and relations among variables. The similar interface is used or logistic and linear regression.

Thus, we can see how useful PySpark library is. It assists us to perform most critical data-oriented tasks like classification, regression, predictive modeling, pattern analysis, and many others for modern businesses and other typical computations used in making critical decisions. It has the ability to serve both real-time and batch processing handling workloads efficiently. And, the best part is that Spark runs 100 times faster than Hadoop due to its processing in main memory which reduces the extra effort for disk-oriented input-output tasks




Sponsored:
To find embedded technology information about MCU, IoT, AI etc Check out embedkari.com.


   

We publish blog post about open source products. If you are interested in sharing knowledge about open source products, please visit write for us

Subscribe to our newsletter.

We will send mail once in a week about latest updates on open source tools and technologies. subscribe our newsletter



Related Articles

Introduction to Apache Cassandra

  • cassandra database nosql

Apache Cassandra was designed by Facebook and was open-sourced in July 2008. It is regarded as perfect choice when the users demand scalability and high availability without any impact towards performance. Apache Cassandra is highly scalable, high-performance distributed database designed to handle large voluminous amounts of data across many commodity servers with no failure.

Read More


Enhancing The Experience With Android TV: Briefly About One Developer's Impressions

  • android android-tv

In my software development career, I have always been attracted to new technologies and innovative solutions. Project by project, I have been exploring something new, discovering and implementing different approaches and trying new solutions. When Android TV showed up, I set a new personal goal. I described my impressions and the overall opinion on the application development for Android TV right here. Hope you will find it useful and interesting.

Read More


Connect to MongoDB and Perform CRUD using Java

  • java mongodb database programming

MongoDB is a popular and widely used open source NoSQL database. MongoDB is a distributed database at its core, so high availability, horizontal scaling, and geographic distribution is quite possible. It is licensed under Server Side Public License. Recently they moved to Server Side Public License, before that MongoDB was released under AGPL. This article will provide basic example to connect and work with MongoDB using Java.

Read More


Appserver.io – The First Multithreaded Application Server for PHP written in PHP

  • appserver application-server php

What if you could reliably run PHP without Nginx or Apache, but also without relying on its internal server? What if you could do async operations in PHP with true multi threading, fully taking advantage of multi core processors without hacks or a jungle of callbacks? What if you had drag and drop installation support for your PHAR packaged web apps in an environment identical to its production counterpart? Welcome to appserver.io – the worlds first open source application server for PHP.

Read More


10 sites to get the large data set or data corpus for free

  • search test-data large-data-set data-corpus dataset

You may require GBs of data to do performance or load testing. How your app behaves when there is loads of data. You need to know the capacity of your application. This is the frequently asked question from the sales team "The customer is having 100GB of data and he wants to know whether our product will handle this? If so how much RAM / Disk storage required?". This article has pointers to the large data corpus.

Read More



ONLYOFFICE Document Server, an online office app for Nextcloud and ownCloud

  • office office-suite word spreadsheet

ONLYOFFICE Document Server is a free collaborative online office suite including viewers and editors for texts, spreadsheets and presentations, fully compatible with Office Open XML formats (.docx, .xlsx, .pptx). This article provides you the overview of ONLYOFFICE Document Server, its features, installation and integration with Nextcloud and ownCloud.

Read More


How to install and setup Redis

  • redis install setup redis-cluster

Redis is an open source (BSD licensed), in-memory data structure store, used also as a database cache and message broker. It is written in ANSI C and works in all the operating systems. It supports data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperloglogs, geospatial indexes with radius queries and streams. This article explains about how to install Redis.

Read More


Univention Corporate Server - An open source identity management system

  • ucs identity-management-system

Univention Corporate Server is an open source identity management system, an IT infrastructure and device management solution and an extensible platform with a store-like App Center that includes tested third party applications and further UCS components: This is what Univention combines in their main product Univention Corporate Server, a Debian GNU/Linux based enterprise distribution. This article provides you the overview of Univention Corporate Server, its feature and installation.

Read More


LogicalDOC - Open Source DMS

  • dms document-management-system

LogicalDOC is both a document management and a collaboration system. The software is loaded with many functions and allows organizing, indexing, retrieving, controlling and distributing important business documents securely and safely for any organization and individual.

Read More


An Introduction to the UnQLite Embedded NoSQL Database Engine

  • database nosql embedded key-value-store

UnQLite is an embedded NoSQL database engine. It's a standard Key/Value store similar to the more popular Berkeley DB and a document-store database similar to MongoDB with a built-in scripting language called Jx9 that looks like Javascript. Unlike most other NoSQL databases, UnQLite does not have a separate server process. UnQLite reads and writes directly to ordinary disk files. A complete database with multiple collections is contained in a single disk file. The database file format is cross-platform, you can freely copy a database between 32-bit and 64-bit systems or between big-endian and little-endian architectures.

Read More


mkcert - No config certificate authority tool

  • certificate ssl security cert go go-lang

Mkcert is go-lang project, which is super easy tool to setup certificate authority without any configuration. Using certificates are inevitable these days, data should be transferred in a secure communication channel. Buying a certificate is expensive and mostly companies buy certificates only for production systems. In Dev setup, if we use self-signed certificate then there will be trust errors. mkcert automatically creates and installs a local CA in the system root store, and generates locally-trusted certificates.

Read More


An introduction to LucidWorks Enterprise Search

  • lucene solr search engine enterprise

Lucidworks Enterprise search solution is built on top of Apache Solr. It scales seamlessly w/sub-second response times under extreme query loads for multi-billion document collections. It has user friendly UI, which does all the job of configuration and search.

Read More


An introduction to web cache proxy server - nuster

  • web-cache proxy-server load-balancer

Nuster is a simple yet powerful web caching proxy server based on HAProxy. It is 100% compatible with HAProxy, and takes full advantage of the ACL functionality of HAProxy to provide fine-grained caching policy based on the content of request, response or server status. This article gives an overview of nuster - web cache proxy server, its installation and few examples of how to use it.

Read More


Holistic usage guide for OpenSSL

  • openssl security certificate tools

OpenSSL is a general purpose cryptographty toolkit that provides an open source implementation of Transport Layer Security(TLS) and Secure Socket Layer(SSL) protocols. It is written in C,assembly and Perl language but wrappers are available in all languages. This article explains about OpenSSL commands.

Read More


An introduction to MongoDB

  • mongodb database document-oriented-databse no-sql c++ data-mining

MongoDB is the most exciting SQL-free database currently available in the market. The new kid on the block, called MongoDB is a scalable, high-performance, open source, schema free and document oriented database that focuses on the ideas of NoSQL Approach. Written in C++, it has taken rapid strides since its emergence into the public sphere as a popular way to build your database applications.

Read More


Quick Start Programming Guide for redis using java client Jedis

  • redis jedis redis-client programming database java

Redis is an open source (BSD licensed), in-memory data structure store, used also as a database cache and message broker. It supports data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperloglogs, geospatial indexes with radius queries and streams. This article explains about how to communicate with Redis using Java client Jedis.

Read More


Column database vs OLAP

  • business-intelligence olap column-database

OLAP (Online Analytical Processing), Reporting, Data mining related tasks are usually done by Business intelligence products. They do powerful Extraction, Transformation and Loading (ETL) the data and provides various reports. They use relational database as its back end. How could they generate better reports? Will column DB do a better job?

Read More


React Patent Clause Licensing issue. Is it something to worry?

  • react react-license facebook

React libraries from Facebook is one of the most used UI libraries. It is competitive to AngularJS. There are many open source UI components or frameworks available but mostly people narrow down to two choices Angular / React. Recently Facebook has updated React license and added a patent clause which makes companies to worry and rethink whether to use React or not.

Read More


Is it good time to switch from MySQL and choose some other database?

  • mysql mysql-alternative

Recently you might have seen many news that big organization like Redhat, Fedora, Wikipedia switched their database from MySQL. Now Google also going to join them. Why everyone want to move out of MySQL? It is once a popular and stable software. What is the reason behind and what are the possible alternatives.

Read More


Benefits in contributing to Open Source

  • open-source opensource contribute benifits

What the benefit will i get, if i contribute to Open Source? This is the frequently asked question by many people. I just want to pen down the benefits which i know and i hope you will agree with it.

Read More