PySpark is a tool or interface of Apache Spark developed by the Apache Spark community and Python to support Python to work with Spark. This tool collaborates with Apache Spark using APIs written in Python to support features like Spark SQL, Spark DataFrame, Spark Streaming, Spark Core, Spark MLlib, etc. It provides an interactive PySpark shell to analyze structured and semi-structured data in a distributed environment and process them by providing optimized APIs that help the program to read data from various data sources. PySpark features are implemented in the py4j library in Python. Due to the availability of the Py4j library, it facilitates users to work with RDDs (Resilient Distributed Datasets) in the Python programming language. Python supports many libraries that support big data processing and machine learning.
You can install PySpark using PyPi by using the following command:
Following are the main four main characteristics of PySpark:
In PySpark, RDD is an acronym that stands for Resilient Distributed Datasets. It is a core data structure of PySpark. It is a low-level object that is highly efficient in performing distributed tasks.
The PySpark's RDDs are the elements that can run and operate on multiple nodes to do parallel processing on a cluster. These are immutable elements. It means that if you once create an RDD, you cannot change it. RDDs are also fault-tolerant. In the case of any failure, they recover automatically. We can apply multiple operations on RDDs to achieve a certain task.
Following is a list of key advantages and disadvantages of PySpark:
Advantages of PySpark
Disadvantages of PySpark
PySpark is easy to learn and implement. It doesn't require the expertise of many programming languages or databases. You can learn it easily if you know a programming language and framework. Before learning the concept of PySpark, you should learn some knowledge of Apache Spark and Python. It will be very helpful to learn the advanced concepts of PySpark.
In PySpark, every transformation generates a new partition. Partitions use HDFS API to make partitions immutable, distributed, and fault-tolerant. Partitions are also aware of data locality.
Following are the key differences between an RDD, a DataFrame, and a DataSet:
RDD
DataFrame
DataSet
SparkContext acts as the entry point to any spark functionality. When the Spark application runs, it starts the driver program, and the main function and SparkContext get initiated. After that, the driver program runs the operations inside the executors on worker nodes. In PySpark, SparkContext is known as PySpark SparkContext. It uses Py4J (library) to launch a JVM and then creates a JavaSparkContext. The PySpark's SparkContext is by default available as 'sc', so it doesn't mean creating a new SparkContext.
The PySpark StorageLevel is used to control the storage of RDD. It controls how and where the RDD is stored. PySpark StorageLevel decides if the RDD is stored on the memory, over the disk, or both. It also specifies whether we need to replicate the RDD partitions or serialize the RDD.
Following is the code for PySpark StorageLevel:
Data cleaning is the process of preparing data by analyzing the data and removing or modifying data if it is incorrect, incomplete, irrelevant, duplicated, or improperly formatted.
PySpark SparkConf is mainly used if we have to set a few configurations and parameters to run a Spark application on the local/cluster. In other words, we can say that PySpark SparkConf is used to provide configurations to run a Spark application.
Different types of algorithms supported in PySpark are:
SparkCore is a general execution engine for the Spark platform, including all the functionalities. It offers in-memory computing capabilities to deliver a good speed, a generalized execution model to support various applications, and Java, Scala, and Python APIs that make the development easy.
The main responsibility of SparkCore is to perform all the basic I/O functions, scheduling, monitoring, etc. It is also responsible for fault recovery and effective memory management.
The key functions of SparkCore are:
PySpark facilitates users to upload their files using sc.addFile. Here, sc is our default SparkContext. We can also get the path of the working directory using SparkFiles.get. SparkFiles provides the following types of class methods to resolve the path to the files added through SparkContext.addFile():
In PySpark, serialization is a process that is used to conduct performance tuning on Spark. PySpark supports serializers because we have to continuously check the data sent or received over the network to the disk or memory. PySpark supports two types of serializers. They are as follows:
PySpark ArrayType is a collection data type that extends the PySpark's DataType class, which is the superclass for all kinds. The PySpark ArrayType contains only the same types of items. The ArraType() method can also be used to construct an instance of an ArrayType.
It accepts two arguments:
Example:
The most frequently used Spark ecosystems are:
Just like Apache Spark, PySpark also provides a machine learning API known as MLlib. MLlib supports the following types of machine learning algorithms:
PySpark Partition is a method of splitting a large dataset into smaller datasets based on one or more partition keys. It enhances the execution speed as transformations on partitioned data run quicker because each partition's transformations are executed in parallel. PySpark supports both partitioning in memory (DataFrame) and partitioning on disc (File system). When we make a DataFrame from a file or table, PySpark creates the DataFrame in memory with a specific number of divisions based on specified criteria.
It also facilitates us to create a partition on multiple columns using partitionBy() by passing the columns you want to partition as an argument to this method.
Syntax:
In PySpark, it is recommended to have 4x of partitions to the number of cores in the cluster available for application.
PySpark DataFrames are the distributed collection of well-organized data. These are the same as relational databases tables and are placed into named columns. PySpark DataFrames are better optimized than R or Python programming language because these can be created from different sources like Hive Tables, Structured Data Files, existing RDDs, external databases, etc.
The biggest advantage of PySpark DataFrame is that the data in the PySpark DataFrame is distributed across different machines in the cluster, and the operations performed on this would be run parallel on all the machines. This facilitates handling a large collection of structured or semi-structured data of a range of petabytes.
In PySpark, joins merge or join two DataFrames together. It facilitates us to link two or multiple DataFrames together.
INNER Join, LEFT OUTER Join, RIGHT OUTER Join, LEFT ANTI Join, LEFT SEMI Join, CROSS Join, and SELF Join are among the SQL join types PySpark supports. Following is the syntax of PySpark Join.
Syntax:
Parameter Explanation:
The join() procedure accepts the following parameters and returns a DataFrame:
Types of Join in PySpark DataFrame
Join String | Equivalent SQL Join |
---|---|
inner | INNER JOIN |
outer, full, fullouter, full_outer | FULL OUTER JOIN |
left, leftouter, left_outer | LEFT JOIN |
right, rightouter, right_outer | RIGHT JOIN |
cross | |
anti, leftanti, left_anti | |
semi, leftsemi, left_semi |
In PySpark, the Parquet file is a column-type format supported by several data processing systems. By using the Parquet file, Spark SQL can perform both read and write operations.
The Parquet file contains a column type format storage which provides the following advantages:
In PySpark, a cluster manager is a cluster mode platform that facilitates Spark to run by providing all resources to worker nodes according to their requirements.
A Spark cluster manager ecosystem contains a master node and multiple worker nodes. The master nodes provide the worker nodes with the resources like memory, processor allocation, etc., according to the nodes' requirements with the help of the cluster manager.
PySpark supports the following cluster manager types:
PySpark is faster than pandas because it supports the parallel execution of statements in a distributed environment. For example, PySpark can be executed on different cores and machines, unavailable in Pandas. This is the main reason why PySpark is faster than pandas.
The main difference between get(filename) and getrootdirectory() is that the get(filename) is used to achieve the correct path of the file that is added through SparkContext.addFile(). On the other hand, the getrootdirectory() is used to get the root directory containing the file added through SparkContext.addFile().
In PySpark, SparkSession is the entry point to the application. In the first version of PySpark, SparkContext was used as the entry point. SparkSession is the replacement of SparkContext since PySpark version 2.0. After the PySpark version 2.0, SparkSession acts as a starting point to access all of the PySpark functionalities related to RDDs, DataFrame, Datasets, etc. It is also a Unified API used to replace the SQLContext, StreamingContext, HiveContext, and all other contexts in Pyspark.
The SparkSession internally creates SparkContext and SparkConfig according to the details provided in SparkSession. You can create SparkSession by using builder patterns.
Following is the list of key advantages of PySpark RDD:
Immutability: The PySpark RDDs are immutable. If you create them once, you cannot modify them later. You have to create a new RDD whenever you try to apply any transformation operations on the RDDs.
Fault Tolerance: The PySpark RDD provides fault tolerance features. Whenever an operation fails, the data gets automatically reloaded from other available partitions. This provides a seamless experience of execution of the PySpark applications.
Partitioning: When we create an RDD from any data, the elements in the RDD are partitioned to the cores available by default.
Lazy Evolution: PySpark RDD follows the lazy evolution process. In PySpark RDD, the transformation operations are not performed as soon as they are encountered. The operations would be stored in the DAG and are evaluated once it finds the first RDD action.
In-Memory Processing: The PySpark RDD is used to help in loading data from the disk to the memory. You can persist RDDs in the memory for reusing the computations.
The common workflow of a spark program can be described in the following steps:
We can implement machine learning in Spark by using MLlib. Spark provides a scalable machine learning record called MLlib. It is mainly used to create machine learning scalable and straightforward with ordinary learning algorithms and use cases like clustering, weakening filtering, dimensional lessening, etc.
PySpark supports custom profilers. The custom profilers are used for building predictive models. Profilers are also used for data review to ensure that it is valid, and we can use it in consumption. When we require a custom profiler, it has to define some of the following methods:
The Spark driver is a plan that runs on the master node of a machine. It is mainly used to state actions and alterations on data RDDs.
The PySpark SparkJobinfo is used to get information about the SparkJobs that are in execution.
Following is the code for using the SparkJobInfo:
The main task of Spark Core is to implement several vital functions such as memory management, fault-tolerance, monitoring jobs, job setting up, and communication with storage systems. It also contains additional libraries, built atop the middle that is used to diverse workloads for streaming, machine learning, and SQL.
The Spark Core is mainly used for the following tasks:
The PySpark SparkStageInfo is used to get information about the SparkStages available at that time. Following is the code used for SparkStageInfo:
The Apache Spark execution engine is a chart execution engine that facilitates users to examine massive data sets with a high presentation. You need to detain Spark in the memory to pick up performance radically if you want data to be manipulated with manifold stages of processing.
Akka is used in PySpark for scheduling. When a worker requests a task to the master after registering, the master assigns a task to him. In this case, Akka sends and receives messages between the workers and masters.
The startsWith() and endsWith() methods in PySpark belong to the Column class and are used to search DataFrame rows by checking if the column value starts with some value or ends with some value. Both are used for filtering data in applications.
The RDD lineage is a procedure that is used to reconstruct the lost data partitions. The Spark does not hold up data replication in the memory. If any data is lost, we have to rebuild it using RDD lineage. This is the best use case as RDD always remembers how to construct from other datasets.
Yes, we can create PySpark DataFrame from external data sources. The real-time applications use external file systems like local, HDFS, HBase, MySQL table, S3 Azure, etc. The following example shows how to create DataFrame by reading data from a csv file present in the local system:
PySpark supports csv, text, avro, parquet, tsv and many other file extensions.
Following is the list of main attributes used in SparkConf:
We can use the following steps to associate Spark with Mesos:
Spark supports the following three file systems:
We can trigger the automatic cleanups in Spark by setting the parameter ' Spark.cleaner.ttl' or separating the long-running jobs into dissimilar batches and writing the mediator results to the disk.
We can limit the information moves when working with Spark by using the following manners:
Hive is used in HQL (Hive Query Language), and Spark SQL is used in Structured Query language for processing and querying data. We can easily join SQL table and HQL table to Spark SQL. Flash SQL is used as a unique segment on the Spark Core motor that supports SQL and Hive Query Language without changing any sentence structure.
In PySpark, DStream stands for Discretized Stream. It is a group of information or gathering of RDDs separated into little clusters. It is also known as Apache Spark Discretized Stream and is used as a gathering of RDDs in the grouping. DStreams are based on Spark RDDs and are used to enable Streaming to flawlessly coordinate with some other Apache Spark segments like Spark MLlib and Spark SQL.