Apache Spark Basics

Apache^® Spark™ is a fast, general-purpose engine for large-scale data processing.

Every Spark application consists of a driver program that manages the execution of your application on a cluster. The workers on a Spark enabled cluster are referred to as executors. The driver process runs the user code on these executors.

In a typical Spark application, your code will establish a SparkContext, create a Resilient Distributed Dataset (RDD) from external data, and then execute methods known as transformations and actions on that RDD to arrive at the outcome of an analysis.

An RDD is the main programming abstraction in Spark and represents an immutable collection of elements partitioned across the nodes of a cluster that can be operated on in parallel. A Spark application can run locally on a single machine or on a cluster.

Spark is mainly written in Scala^® and has APIs in other programming languages, including MATLAB^®. The MATLAB API for Spark exposes the Spark programing model to MATLAB and enables MATLAB implementations of numerous Spark functions. Many of these MATLAB implementations of Spark functions accept function handles or anonymous functions as inputs to perform various types of analyses.

Running against Spark

To run against Spark means executing an application against a Spark enabled cluster using a supported cluster manager. A cluster can be local or on a network. You can run against Spark in two ways:

Execute commands in an interactive shell that is connected to Spark.
Create and execute a standalone application against a Spark cluster.

When using an interactive shell, Spark allows you to interact with data that is distributed on disk or in memory across many machines and perform ad-hoc analysis. Spark takes care of the underlying distribution of work across various machines. Interactive shells are only available in Python^® and Scala.

The MATLAB API for Spark in MATLAB Compiler™ provides an interactive shell similar to a Spark shell that allows you to debug your application prior to deploying it. The interactive shell only runs against a local cluster.

When creating and executing standalone applications against Spark, applications are first packaged or compiled as standalone applications before being executed against a Spark enabled cluster. You can author standalone applications in Scala, Java^®, Python, and MATLAB.

The MATLAB API for Spark in MATLAB Compiler lets you create standalone applications that can run against Spark.

Cluster Managers Supported by Spark

Local

A local cluster manager represents a pseudo-cluster and works in a nondistributed mode on a single machine. You can configure it to use one worker thread, or on a multicore machine, multiple worker threads. In applications, it is denoted by the word local.

Note

The MATLAB API for Spark, which allows you to interactively debug your applications, works only with a local cluster manager.

Standalone

A Standalone cluster manager ships with Spark. It consists of a master and multiple workers. To use a Standalone cluster manager, place a compiled version of Spark on each cluster node. A Standalone cluster manager can be started using scripts provided by Spark. In applications, it is denoted as: spark://host:port. The default port number is 7077.

Note

The Standalone cluster manager that ships with Spark is not to be confused with the standalone application that can run against Spark. MATLAB Compiler does not support the Standalone cluster manager.

YARN

A YARN cluster manager was introduced in Hadoop^® 2.0. It is typically installed on the same nodes as HDFS™. Therefore, running Spark on YARN lets Spark access HDFS data easily. In applications, it is denoted using the term yarn. There are two modes that are available when starting applications on YARN:

In yarn-client mode, the driver runs in the client process, and the application master is used only for requesting resources from YARN.
In yarn-cluster mode, the Spark driver runs inside an application master process that is managed by YARN on the cluster, and the client can retire after initiating the application.

Note

MATLAB Compiler supports the YARN cluster manager only in yarn-client mode.

Mesos

A Mesos cluster manager is an open-source cluster manager developed by Apache. In applications, it is usually denoted as: mesos://host:port. The default port number is 5050.

Note

MATLAB Compiler does not support a Mesos cluster manager.

You can use the following table to see which MATLAB Compiler deployment option is supported by each cluster manager.

Deploy Against Spark Option	Local Cluster (`local`)	Hadoop Cluster (`yarn-client`)
Deploy standalone applications containing tall arrays.	Not supported.	Supported.
Deploy standalone applications created using the MATLAB API for Spark.	Supported.	Supported.
Interactively debug your applications using the MATLAB API for Spark.	Supported.	Not supported.

Relationship Between Spark and Hadoop

The relationship between Spark and Hadoop comes into play only if you want to run Spark on a cluster that has Hadoop installed. Otherwise, you do not need Hadoop to run Spark.

To run Spark on a cluster you need a shared file system. A Hadoop cluster provides access to a distributed file-system via HDFS and a cluster manager in the form of YARN. Spark can use YARN as a cluster manager for distributing work and use HDFS to access data. Also, some Spark applications can use Hadoop’s MapReduce programming model, but MapReduce does not constitute the core programming model in Spark.

Hadoop is not required to run Spark on cluster. You can also use other options such as Mesos.

Note

The deployment options in MATLAB Compiler currently support deploying only against a Spark enabled Hadoop cluster.

Driver

Every Spark application consists of a driver program that initiates various operations on a cluster. The driver is a process in which the main() method of a program runs. The driver process runs user code that creates a SparkContext, creates RDDs, and performs transformations and actions. When a Spark driver executes, it performs two duties:

Convert a user program into tasks.
The Spark driver application is responsible for converting a user program into units of physical execution called tasks. Tasks are the smallest unit of work in Spark.
Schedule tasks on executors.
The Spark driver tries to schedule each task in an appropriate location, based on data placement. It also tracks the location of cached data, and uses it to schedule future tasks that access that data.

Once the driver terminates, the application is finished.

Note

When using the MATLAB API for Spark in MATLAB Compiler, MATLAB application code becomes the Spark driver program.

Executor

A Spark executor is a worker process responsible for running the individual tasks in a given Spark job. Executors are started at the beginning of a Spark application and persist for the entire lifetime of an application. Executors perform two roles:

Run the tasks that make up the application, and return the results to the driver.
Provide in-memory storage for RDDs that are cached by user programs.

RDD

A Resilient Distributed Dataset or RDD is a programming abstraction in Spark. It represents a collection of elements distributed across many nodes that can be operated in parallel. RDDs tend to be fault-tolerant. You can create RDDs in two ways:

By loading an external dataset.
By parallelizing a collection of objects in the driver program.

After creation, you can perform two types of operations using RDDs: transformations and actions.

Transformations

Transformations are operations on an existing RDD that return a new RDD. Many, but not all, transformations are element-wise operations.

Actions

Actions compute a final result based on an RDD and either return that result to the driver program or save it to an external storage system such as HDFS.

Distinguishing Between Transformations and Actions

Check the return data type. Transformations return RDDs, whereas actions return other data types.

SparkConf

SparkConf stores the configuration parameters of the application being deployed to Spark. Every application must be configured prior to being deployed on a Spark cluster. Some of the configuration parameters define properties of the application and some are used by Spark to allocate resources on the cluster. The configuration parameters are passed onto a Spark cluster through a SparkContext.

SparkContext

A SparkContext represents a connection to a Spark cluster. It is the entry point to Spark and sets up the internal services necessary to establish a connection to the Spark execution environment.