PySpark: Installation, RDDs, MLib, Broadcast and Accumulator

  •        0
  

We aggregate and tag open source projects. We have collections of more than one million projects. Check out the projects section.



We knew that Apace Spark- the most famous parallel computing model or processing the massive data set is written in Scala programming language. The Apace foundation offered a tool to support the Python in Spark which was named PySpark. The PySpark allows us to use RDDs in Python programming language through a library called Py4j. This article provides introduction about PySpark, RDD, MLib, Broadcase and Accumulator.

The PySpark Shell connects Python APIs with spark core initiating the SparkContext making it so robust. Most of the data scientists and programmers prefer Python due to its huge set of libraries. Thus, integration of Python is not less than readymade syrup for them because their work revolves around the millions of unstructured data sets which require analysis for further prediction pattern analysis or any other task.

The pace in technology is connecting things producing billions of data from them every day. This data is driving innovation in businesses, healthcare, retail, FinTech, scientific research and thousands of other real-world scenarios from critical decision making to accurate predictions. This is the main reason behind increased demand for data scientists and engineers.

An Apache Spark Certification Training Course, blogs, videos, tutorials and various other sources accessible over the web which will help you to explore this technology in-depth so that you can uncover the hidden potential of data.

Now let’s explore PySpark installation, RDDs, machine learning library, and broadcast and accumulators in brief:

PySpark Set Up Installation:

1). First, install Java and Scala on your system.

2). Depending upon your Hadoop version download PySpark from here.

3). Extract the tar file.

> tar –xvf Downloads / spark-2.4.0-bin-hadoop2.7.tgz

4). A directory name spark-2.4.0-bin-hadoop2.7 will be created. Set up Spark and Py4j path with commands below:

>  export SPARK_HOME = /home/hadoop/spark-2.4.0-bin-hadoop2.7

> export PATH = $PATH:/home/hadoop/spark-2.4.0-bin-hadoop2.7/bin

> export PYTHONPATH = $SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.3-src.zip:$PYTHONPATH

> export PATH = $SPARK_HOME/python:$PATH

You can also set these environments globally. Keep them in .bashrc file and execute the below command:

# source .bashrc 

5). Since we are done with the environment setup, it’s time to invoke PySpark shell from the Spark directory. Write command:

# ./bin/pyspark 

It will initiate the PySpark shell. Now, you are good to go for implementing cool APIs.

 

Public classes in PySpark:

Before going further let’s understand the public classes in PySpark:

 

  • SparkContext: key entry point for Spark.
  • StorageLevel: cache persistent level.
  • Broadcast: reused across multiple tasks.
  • Accumulator: add-only shared variable.
  • SparkFiles: accessing job files.
  • TaskContext: running task status and other information
  • SparkConfig: for configuration of Spark.
  • RDD: the basis abstraction in Spark.
  • RDDBarrier: combines RDD under a barrier stage for execution
  • BarrierTaskInfo: task information
  • BarrierTaskContext: extra information and tooling for barrier execution.

 

Resilient Distributed Dataset (RDD):

RDDs are the immutable elements which are executed and operated over the various nodes for performing parallel processing over the cluster. Since immutable so they can’t be altered once created. They are fault tolerant element which in case of failure will recover itself automatically. You can perform various operations over RDD for accomplishing a task. There are two ways for performing RDDs operations: Transformation and Action.

Transformation:

These are the operations for creating a new RDD from the existing one such as map, groupBy, Filter.

Action:

These operations instruct the Spark engine to accomplish computations and send output to the driver.

Broadcast and Accumulator:

Apache Spark uses the shared variables for completing parallel processing tasks. When a task is given by the driver for its execution then shared variables are distributed among each node of the cluster for accomplishing this task. These shared variables are accessible with two ways in Spark: Broadcast and Accumulator. Let’s explore them:

Broadcast:

The role of a broadcast variable is to clone and save data in all nodes of cluster. Broadcast variables are cached over the machine. It uses the command:

$SPARK_HOME/bin/spark-submit broadcast.py 

Accumulator:

Accumulators aggregate the data with commutative and associative methods.  You can use it as:

class pyspark.Accumulator(aid, value, accum_param) 

Here, we are using an accumulator variable. It contains attributes value, accum, and param for storing data and returning accumulator’s value. It is only usable in a driver script. The command or running an accumulator variable is:

$SPARK_HOME/bin/spark-submit accumulator.py

MLib:

MLib is a machine learning API offered by Spark. This API is available in Python for PySpark influencing users to leverage various kinds of algorithms such as:

mlib.classification:

It provides a method for binary classification, multiclass classification, and regression model analysis. It leverages the common classification algorithms like Random Forest, Decision Tree and Naive Bayes etc.

mlib.fpm:

FPM stands for frequent pattern matching which is mining frequent items, substructures or subsequences. It is the primary step in the analysis of a massive data set.

spark.mlib:

spark.mlib relies on model-based collaborative filtering. Here, user and items are stated with a tiny set of latent factors which assists in predicting missing entries. It is based on the Alternating Least Squares (LSA) algorithms for understanding the latent factors.

mlib.linalg:

It is the utility available for dealing with linear algebra problem sets.

mlib.recommendation:

Recommendation engines leverage the collaborative filtering for delivering the result. Its strategy is aimed towards the filling of missing entries in a user item association matrix.

mlib.clustering:

Clustering falls under the unsupervised learning. Here, on the basis of the similar notion, subsets of entries are grouped together.

mlib.regression:

The main objective of regression is to identify the dependencies and relations among variables. The similar interface is used or logistic and linear regression.

Thus, we can see how useful PySpark library is. It assists us to perform most critical data-oriented tasks like classification, regression, predictive modeling, pattern analysis, and many others for modern businesses and other typical computations used in making critical decisions. It has the ability to serve both real-time and batch processing handling workloads efficiently. And, the best part is that Spark runs 100 times faster than Hadoop due to its processing in main memory which reduces the extra effort for disk-oriented input-output tasks




   

We publish blog post about open source products. If you are interested in sharing knowledge about open source products, please visit write for us

Subscribe to our newsletter.

We will send mail once in a week about latest updates on open source tools and technologies. subscribe our newsletter



Related Articles

Introduction to Apache Cassandra

  • cassandra database nosql

Apache Cassandra was designed by Facebook and was open-sourced in July 2008. It is regarded as perfect choice when the users demand scalability and high availability without any impact towards performance. Apache Cassandra is highly scalable, high-performance distributed database designed to handle large voluminous amounts of data across many commodity servers with no failure.

Read More


Desktop Apps using Electron JS with centralized data control

  • electronjs couchdb pouchdb desktop-app

When there is a requirement for having local storage for the desktop application context and data needs to be synchronized to central database, we can think of Electron with PouchDB having CouchDB stack. Electron can be used for cross-platform desktop apps with pouch db as local storage. It can sync those data to centralized database CouchDB seamlessly so any point desktop apps can recover or persist the data. In this article, we will go through of creation of desktop apps with ElectronJS, PouchDB and show the sync happens seamlessly with remote CouchDB.

Read More


Enhancing The Experience With Android TV: Briefly About One Developer's Impressions

  • android android-tv

In my software development career, I have always been attracted to new technologies and innovative solutions. Project by project, I have been exploring something new, discovering and implementing different approaches and trying new solutions. When Android TV showed up, I set a new personal goal. I described my impressions and the overall opinion on the application development for Android TV right here. Hope you will find it useful and interesting.

Read More


Connect to MongoDB and Perform CRUD using Java

  • java mongodb database programming

MongoDB is a popular and widely used open source NoSQL database. MongoDB is a distributed database at its core, so high availability, horizontal scaling, and geographic distribution is quite possible. It is licensed under Server Side Public License. Recently they moved to Server Side Public License, before that MongoDB was released under AGPL. This article will provide basic example to connect and work with MongoDB using Java.

Read More


Appserver.io – The First Multithreaded Application Server for PHP written in PHP

  • appserver application-server php

What if you could reliably run PHP without Nginx or Apache, but also without relying on its internal server? What if you could do async operations in PHP with true multi threading, fully taking advantage of multi core processors without hacks or a jungle of callbacks? What if you had drag and drop installation support for your PHAR packaged web apps in an environment identical to its production counterpart? Welcome to appserver.io – the worlds first open source application server for PHP.

Read More



10 sites to get the large data set or data corpus for free

  • search test-data large-data-set data-corpus dataset

You may require GBs of data to do performance or load testing. How your app behaves when there is loads of data. You need to know the capacity of your application. This is the frequently asked question from the sales team "The customer is having 100GB of data and he wants to know whether our product will handle this? If so how much RAM / Disk storage required?". This article has pointers to the large data corpus.

Read More


ONLYOFFICE Document Server, an online office app for Nextcloud and ownCloud

  • office office-suite word spreadsheet

ONLYOFFICE Document Server is a free collaborative online office suite including viewers and editors for texts, spreadsheets and presentations, fully compatible with Office Open XML formats (.docx, .xlsx, .pptx). This article provides you the overview of ONLYOFFICE Document Server, its features, installation and integration with Nextcloud and ownCloud.

Read More


Getting Started on Angular 7

  • angular ui-ux front-end-framework

Angular is a platform for building responsive web, native desktop and native mobile applications. Angular client applications are built using HTML, CSS and Typescript. Typescript is a typed superset of Javascript that compiles to plain Javascript. Angular core and optional modules are built using Typescript. Code has been licensed as MIT License.

Read More


How to install and setup Redis

  • redis install setup redis-cluster

Redis is an open source (BSD licensed), in-memory data structure store, used also as a database cache and message broker. It is written in ANSI C and works in all the operating systems. It supports data structures such as strings, hashes, lists, sets, sorted sets with range queries, bitmaps, hyperloglogs, geospatial indexes with radius queries and streams. This article explains about how to install Redis.

Read More


Univention Corporate Server - An open source identity management system

  • ucs identity-management-system

Univention Corporate Server is an open source identity management system, an IT infrastructure and device management solution and an extensible platform with a store-like App Center that includes tested third party applications and further UCS components: This is what Univention combines in their main product Univention Corporate Server, a Debian GNU/Linux based enterprise distribution. This article provides you the overview of Univention Corporate Server, its feature and installation.

Read More


Getting Started With Django Python Web Framework

  • django python web-framework

Django is a high-level Python Web framework that encourages rapid development and clean, pragmatic design. It is pre-loaded with user authentication, content administration, site maps, RSS feeds and many more tasks. Security features provided are cross site scripting (XSS) protection, cross site request forgery protection, SQL injection protection, click-jacking protection, host header validation, session security and so on. It also provides in built caching framework.

Read More


LogicalDOC - Open Source DMS

  • dms document-management-system

LogicalDOC is both a document management and a collaboration system. The software is loaded with many functions and allows organizing, indexing, retrieving, controlling and distributing important business documents securely and safely for any organization and individual.

Read More


An Introduction to the UnQLite Embedded NoSQL Database Engine

  • database nosql embedded key-value-store

UnQLite is an embedded NoSQL database engine. It's a standard Key/Value store similar to the more popular Berkeley DB and a document-store database similar to MongoDB with a built-in scripting language called Jx9 that looks like Javascript. Unlike most other NoSQL databases, UnQLite does not have a separate server process. UnQLite reads and writes directly to ordinary disk files. A complete database with multiple collections is contained in a single disk file. The database file format is cross-platform, you can freely copy a database between 32-bit and 64-bit systems or between big-endian and little-endian architectures.

Read More


mkcert - No config certificate authority tool

  • certificate ssl security cert go go-lang

Mkcert is go-lang project, which is super easy tool to setup certificate authority without any configuration. Using certificates are inevitable these days, data should be transferred in a secure communication channel. Buying a certificate is expensive and mostly companies buy certificates only for production systems. In Dev setup, if we use self-signed certificate then there will be trust errors. mkcert automatically creates and installs a local CA in the system root store, and generates locally-trusted certificates.

Read More


Cache using Hazelcast InMemory Data Grid

  • hazelcast cache key-value

Hazelcast is an open source In-Memory Data Grid (IMDG). It provides elastically scalable distributed In-Memory computing, widely recognized as the fastest and most scalable approach to application performance. Hazelcast makes distributed computing simple by offering distributed implementations of many developer-friendly interfaces from Java such as Map, Queue, ExecutorService, Lock and JCache.

Read More


An introduction to LucidWorks Enterprise Search

  • lucene solr search engine enterprise

Lucidworks Enterprise search solution is built on top of Apache Solr. It scales seamlessly w/sub-second response times under extreme query loads for multi-billion document collections. It has user friendly UI, which does all the job of configuration and search.

Read More


An introduction to web cache proxy server - nuster

  • web-cache proxy-server load-balancer

Nuster is a simple yet powerful web caching proxy server based on HAProxy. It is 100% compatible with HAProxy, and takes full advantage of the ACL functionality of HAProxy to provide fine-grained caching policy based on the content of request, response or server status. This article gives an overview of nuster - web cache proxy server, its installation and few examples of how to use it.

Read More


JHipster - Generate simple web application code using Spring Boot and Angular

  • jhipster spring-boot angular web-application

JHipster is one of the full-stack web app development platform to generate, develop and deploy. It provides the front end technologies options of React, Angular, Vue mixed with bootstrap and font awesome icons. Last released version is JHipster 6.0.1. It is licensed under Apache 2 license.

Read More


Know everything about Dependency Injection in Magento 2

  • magento dependency-injection php

Magento 2 is no longer a buzzing term. Since Its inception one and a half years ago, developers are in love with its rich technology stack. The following post emphasizes on Dependency Injection (DI), one of the best design patterns that Magento 2 uses.

Read More


Holistic usage guide for OpenSSL

  • openssl security certificate tools

OpenSSL is a general purpose cryptographty toolkit that provides an open source implementation of Transport Layer Security(TLS) and Secure Socket Layer(SSL) protocols. It is written in C,assembly and Perl language but wrappers are available in all languages. This article explains about OpenSSL commands.

Read More