parallel.cluster.Spark
Description
A parallel.cluster.Spark
object represents and provides access to
a Spark™ cluster. Use the parallel.cluster.Spark
object as input to the
mapreduce
and mapreducer
functions, for specifying the
Spark cluster as the parallel execution environment for tall arrays and mapreduce
.
Creation
Use the parcluster
function to create a
parallel.cluster.Spark
cluster object from a Spark cluster profile. Alternatively, use the
parallel.cluster.Spark
function (described here) to create a
Spark cluster object.
Description
creates asparkCluster
= parallel.cluster.Spark parallel.cluster.Spark
object representing the Spark cluster.
sets the optional properties using one or more name-value arguments on the
sparkCluster
= parallel.cluster.Spark(Name,Value
)parallel.cluster.Spark
object. For example, to change the Spark install folder, use
SparkInstallFolder="/share/spark/spark-3.3.0"
. For a list of valid
properties, see Properties.
Properties
Object Functions
mapreduce | Programming technique for analyzing data sets that do not fit in memory |
mapreducer | Define parallel execution environment for mapreduce and tall arrays |
saveAsProfile | Save cluster properties to specified profile |
saveProfile | Save modified cluster properties to its current profile |
Examples
Limitations
Spark cluster profiles do not support being set as the default profile.
Spark clusters do not support parallel pools and batch jobs.
Tips
Spark clusters place limits on how much memory is available. You must adjust the size of the data to gather to support your workflow.
The amount of data gathered to the client is limited by the Spark properties:
spark.driver.memory
spark.executor.memory
The default value of the spark.executor.memory
property of a
Spark job submitted from MATLAB is 2560 MB.
The amount of data to gather from a single Spark task must fit in these properties. A single Spark task processes one block of data from HDFS, which is 128 MB of data by default. If you gather a tall array containing most of the original data, you must ensure these properties are set to fit.
If these properties are set too small, you see an error like the following.
Error using tall/gather (line 50) Out of memory; unable to gather a partition of size 300m from Spark. Adjust the values of the Spark properties spark.driver.memory and spark.executor.memory to fit this partition.
Adjust the properties either in the default settings of the cluster or directly in MATLAB. To adjust the properties in MATLAB, you can add these Spark properties to the SparkProperties table of the Spark cluster profile.
Name | Value | Type |
---|---|---|
spark.driver.memory | 2048m | String |
spark.executor.memory | 2048m | String |
You can also edit the Spark cluster object.
cluster = parcluster("SparkProfile"); cluster.SparkProperties('spark.driver.memory') = '2048m'; cluster.SparkProperties('spark.executor.memory') = '2048m'; mapreducer(cluster);