Main Content

parallel.cluster.Spark

Spark cluster for mapreducer, mapreduce and tall arrays

Since R2022b

    Description

    A parallel.cluster.Spark object represents and provides access to a Spark™ cluster. Use the parallel.cluster.Spark object as input to the mapreduce and mapreducer functions, for specifying the Spark cluster as the parallel execution environment for tall arrays and mapreduce.

    Creation

    Use the parcluster function to create a parallel.cluster.Spark cluster object from a Spark cluster profile. Alternatively, use the parallel.cluster.Spark function (described here) to create a Spark cluster object.

    Description

    sparkCluster = parallel.cluster.Spark creates a parallel.cluster.Spark object representing the Spark cluster.

    sparkCluster = parallel.cluster.Spark(Name,Value) sets the optional properties using one or more name-value arguments on the parallel.cluster.Spark object. For example, to change the Spark install folder, use SparkInstallFolder="/share/spark/spark-3.3.0". For a list of valid properties, see Properties.

    example

    Properties

    expand all

    Folders to add to MATLAB search path of workers, specified as a character vector, string or string array, or cell array of character vectors.

    When you offload computations to workers, any files that the client needs for computations must also be available on workers. By default, the client attempts to detect and attach these files. To turn off automatic detection, set the AutoAttachFiles property to false. If the software cannot find all the files, or if sending files from client to worker is slow, use one of these options.

    • If the files are in a folder that is not accessible on the workers, set the AttachedFiles property. The cluster copies each file you specify from the client to the workers.

    • If the files are in a folder that is accessible on the workers, you can set the AdditionalPaths property instead. Use the AdditionalPaths property to add paths to the MATLAB search path for each worker and avoid copying files unnecessarily from the client to the workers.

    Files and folders sent to workers during a mapreduce call, specified as a character vector, string or string array, or cell array of character vectors

    Specify whether to automatically detect and attach files on the client.

    Data Types: logical

    Path to MATLAB for workers, specified as the comma-separated pair consisting of 'ClusterMatlabRoot' and a character vector. This points to the installation of MATLAB Parallel Server™ for the workers, whether local to each machine or on a network share.

    Data Types: string

    Since R2024b

    This property is read-only.

    Jobs contained within the cluster, returned as a parallel.Job object or an array of parallel.Job objects. When the cluster contains more than one job, MATLAB sorts the jobs in the array by their ID property. This sorting is consistent with the order in which you create the jobs, regardless of the values of the State property of each job.

    License number to use with online licensing.

    Since R2024a

    This property is read-only.

    Logical true if any properties in this cluster have been modified compared to the cluster profile, returned as a logical true (1) if you have modified the cluster properties and logical false (0) otherwise.

    Data Types: logical

    Since R2024a

    Number of computational threads for workers, specified as a nonnegative integer.

    Since R2024a

    Operating system of the cluster worker machines, specified as one of these values:

    • "windows"

    • "unix"

    • "mixed"

    Since R2024a

    Name of the profile used to create cluster object, specified as a character vector.

    Data Types: char

    Specify whether the Spark cluster uses online licensing.

    Data Types: logical

    Path to the Spark installation on client machine, specified as the comma-separated pair consisting of SparkInstallFolder and a character vector or string array. If this property is not set, the default is the value specified by the environment variable SPARK_PREFIX, or if that is not set, then SPARK_HOME.

    Data Types: char

    Map of Spark name-value property pairs to be given to the Spark cluster.

    SparkProperties allows you to override configuration properties for Spark. See the list of properties in the Spark documentation.

    Since R2024a

    This property is read-only.

    Type of this cluster, returned as 'Spark'.

    Since R2024a

    Data associated with the cluster object in the current session, specified as any MATLAB data type.

    Object Functions

    mapreduceProgramming technique for analyzing data sets that do not fit in memory
    mapreducerDefine parallel execution environment for mapreduce and tall arrays
    saveAsProfileSave cluster properties to specified profile
    saveProfileSave modified cluster properties to its current profile

    Examples

    collapse all

    Since R2024a

    Create and use a parallel.cluster.Spark object from a Spark cluster profile.

    To learn how to create a profile for your Spark cluster, see Client Configuration (MATLAB Parallel Server).

    sparkCluster = parcluster("SparkProfile")
    mr = mapreducer(sparkCluster)
    cluster = 
    
     Spark Cluster
    
        Properties: 
    
                          Type: Spark
                       Profile: SparkProfile
                      Modified: false
                    NumThreads: 1
       RequiresOnlineLicensing: false
             ClusterMatlabRoot: /network/installs/MATLAB/R2024a/matlab
    
            SparkInstallFolder: /network/installs/spark/3.0.2-3.2
               SparkProperties: [1x1 parallel.cluster.SparkProperties]
    

    Manually create and use a parallel.cluster.Spark object.

    Create the cluster object by specifying the Spark installation on your machine, and set the Spark cluster as the mapreduce parallel execution environment.

    sparkCluster = parallel.cluster.Spark(SparkInstallFolder="/host/spark-install");
    mr = mapreducer(sparkCluster)

    Limitations

    • Spark cluster profiles do not support being set as the default profile.

    • Spark clusters do not support parallel pools and batch jobs.

    Tips

    Spark clusters place limits on how much memory is available. You must adjust the size of the data to gather to support your workflow.

    The amount of data gathered to the client is limited by the Spark properties:

    • spark.driver.memory

    • spark.executor.memory

    The default value of the spark.executor.memory property of a Spark job submitted from MATLAB is 2560 MB.

    The amount of data to gather from a single Spark task must fit in these properties. A single Spark task processes one block of data from HDFS, which is 128 MB of data by default. If you gather a tall array containing most of the original data, you must ensure these properties are set to fit.

    If these properties are set too small, you see an error like the following.

    Error using tall/gather (line 50)
    Out of memory; unable to gather a partition of size 300m from Spark.
    Adjust the values of the Spark properties spark.driver.memory and 
    spark.executor.memory to fit this partition.
    The error message also specifies the property settings you need.

    Adjust the properties either in the default settings of the cluster or directly in MATLAB. To adjust the properties in MATLAB, you can add these Spark properties to the SparkProperties table of the Spark cluster profile.

    NameValueType
    spark.driver.memory2048mString
    spark.executor.memory2048mString

    You can also edit the Spark cluster object.

    cluster = parcluster("SparkProfile");
    cluster.SparkProperties('spark.driver.memory') = '2048m';
    cluster.SparkProperties('spark.executor.memory') = '2048m';
    mapreducer(cluster);

    Version History

    Introduced in R2022b

    expand all