Deferred Evaluation of Tall Arrays

One of the differences between tall arrays and in-memory MATLAB® arrays is that tall arrays typically remain unevaluated until you request that calculations be performed. (The exceptions to this rule include plotting functions like plot and histogram and some statistical fitting functions like fitlm, which automatically evaluate tall array inputs.) While a tall array is in an unevaluated state, MATLAB might not know its size, its data type, or the specific values it contains. However, you can still use unevaluated arrays in your calculations as if the values were known. This allows you to work quickly with large data sets instead of waiting for each command to execute. For this reason, it is recommended that you use gather only when you require output.

MATLAB keeps track of all the operations you perform on unevaluated tall arrays as you enter them. When you eventually call gather to evaluate the queued operations, MATLAB uses the history of unevaluated commands to optimize the calculation by minimizing the number of passes through the data. Used properly, this optimization can save huge amounts of execution time by eliminating unnecessary passes through large data sets.

Display of Unevaluated Tall Arrays

The display of unevaluated tall arrays varies depending on how much MATLAB knows about the array and its values. There are three pieces of information reflected in the display:

  • Array size — Unknown dimension sizes are represented by the variables M or N in the display. If no dimension sizes are known, then the size appears as MxNx.....

  • Array data type — If the array has an unknown underlying data type, then its type appears as tall array. If the type is known, it is listed as, for example, tall double array.

  • Array values — If the array values are unknown, then they appear as ?. Known values are displayed.

MATLAB might know all, some, or none of these pieces of information about a given tall array, depending on the nature of the calculation.

For example, if the array has a known data type but unknown size and values, then the unevaluated tall array might look like this:

M×N×... tall double array

    ?    ?    ?    ...
    ?    ?    ?    ...
    ?    ?    ?    ...
    :    :    :
    :    :    :

If the type and relative size are known, then the display could be:

 1×N tall char array

    ?    ?    ?   ...

If some of the data is known, then MATLAB displays the known values:

  100×3 tall double matrix

    0.8147    0.1622    0.6443
    0.9058    0.7943    0.3786
    0.1270    0.3112    0.8116
    0.9134    0.5285    0.5328
    0.6324    0.1656    0.3507
    0.0975    0.6020    0.9390
    0.2785    0.2630    0.8759
    0.5469    0.6541    0.5502
      :         :         :
      :         :         :

Evaluation with gather

The gather function is used to evaluate tall arrays. gather accepts tall arrays as inputs and returns in-memory arrays as outputs. For this reason, you can think of this function as a bridge between tall arrays and in-memory arrays. For example, you cannot control if or while loop statements using a tall logical array, but once the array is evaluated with gather it becomes an in-memory logical value that you can use in these contexts.

gather performs all queued operations on a tall array and returns the entire result in memory. Since gather returns results as in-memory MATLAB arrays, standard memory considerations apply. MATLAB might run out of memory if the result returned by gather is too large.

Most of the time you can use gather to see the entire result of a calculation, particularly if the calculation includes a reduction operation such as sum or mean. However, if the result is too large to fit in memory, then you can use gather(head(X)) or gather(tail(X)) to perform the calculation and look at only the first or last few rows of the result.

Resolve Errors with gather

If you enter an erroneous command and gather fails to evaluate a tall array variable, then you must delete the variable from your workspace and recreate the tall array using only valid commands. This is because MATLAB keeps track of all the operations you perform on unevaluated tall arrays as you enter them. The only way to make MATLAB “forget” about an erroneous statement is to reconstruct the tall array from scratch.

Example: Calculate Size of Tall Array

This example shows what an unevaluated tall array looks like, and how to evaluate the array.

Create a datastore for the data set airlinesmall.csv. Convert the datastore into a tall table and then calculate the size.

varnames = {'ArrDelay', 'DepDelay', 'Origin', 'Dest'};
ds = datastore('airlinesmall.csv', 'TreatAsMissing', 'NA', ...
'SelectedVariableNames', varnames);
tt = tall(ds)
tt =

  M×4 tall table

    ArrDelay    DepDelay    Origin    Dest 
    ________    ________    ______    _____

        8          12       'LAX'     'SJC'
        8           1       'SJC'     'BUR'
       21          20       'SAN'     'SMF'
       13          12       'BUR'     'SJC'
        4          -1       'SMF'     'LAX'
       59          63       'LAX'     'SJC'
        3          -2       'SAN'     'SFO'
       11          -1       'SEA'     'LAX'
       :           :          :         :
       :           :          :         :
s = size(tt)
s =

  1×2 tall double row vector

    ?    ?

Preview deferred. Learn more.

Calculating the size of a tall array returns a small answer (a 1-by-2 vector), but the display indicates that an entire pass through the data is still required to calculate the size of tt.

Use the gather function to fully evaluate the tall array and bring the results into memory. As the command executes, there is a dynamic progress display in the command window that is particularly helpful with long calculations.

Note

Always ensure that the result returned by gather will be able to fit in memory. If you use gather directly on a tall array without reducing its size using a function such as mean, then MATLAB might run out of memory.

tableSize = gather(s)
Evaluating tall expression using the Local MATLAB Session:
- Pass 1 of 1: Completed in 0.42 sec
Evaluation completed in 0.48 sec

tableSize =

      123523           4

Example: Multipass Calculations with Tall Arrays

This example shows how several calculations can be combined to minimize the total number of passes through the data.

Create a datastore for the data set airlinesmall.csv. Convert the datastore into a tall table.

varnames = {'ArrDelay', 'DepDelay', 'Origin', 'Dest'};
ds = datastore('airlinesmall.csv', 'TreatAsMissing', 'NA', ...
'SelectedVariableNames', varnames);
tt = tall(ds)
tt =

  M×4 tall table

    ArrDelay    DepDelay    Origin    Dest 
    ________    ________    ______    _____

        8          12       'LAX'     'SJC'
        8           1       'SJC'     'BUR'
       21          20       'SAN'     'SMF'
       13          12       'BUR'     'SJC'
        4          -1       'SMF'     'LAX'
       59          63       'LAX'     'SJC'
        3          -2       'SAN'     'SFO'
       11          -1       'SEA'     'LAX'
       :           :          :         :
       :           :          :         :

Subtract the mean value of DepDelay from ArrDelay to create a new variable AdjArrDelay. Then calculate the mean value of AdjArrDelay and subtract this mean value from AdjArrDelay. If these calculations were all evaluated separately, then MATLAB would require four passes through the data.

AdjArrDelay = tt.ArrDelay - mean(tt.DepDelay,'omitnan');
AdjArrDelay = AdjArrDelay - mean(AdjArrDelay,'omitnan')
AdjArrDelay =

  M×1 tall double column vector

    ?
    ?
    ?
    :
    :

Preview deferred. Learn more.

Evaluate AdjArrDelay and view the first few rows. Because some calculations can be combined, only three passes through the data are required.

gather(head(AdjArrDelay))
Evaluating tall expression using the Local MATLAB Session:
- Pass 1 of 3: Completed in 0.4 sec
- Pass 2 of 3: Completed in 0.39 sec
- Pass 3 of 3: Completed in 0.23 sec
Evaluation completed in 1.2 sec

ans =

    0.8799
    0.8799
   13.8799
    5.8799
   -3.1201
   51.8799
   -4.1201
    3.8799

Summary of Behavior and Recommendations

  1. Tall arrays remain unevaluated until you request output using gather.

  2. Use gather in most cases to evaluate tall array calculations. If you believe the result of the calculations might not fit in memory, then use gather(head(X)) or gather(tail(X)) instead.

  3. Work primarily with unevaluated tall arrays and request output only when necessary. The more queued calculations there are that are unevaluated, the more optimization MATLAB can do to minimize the number of passes through the data.

  4. If you enter an erroneous tall array command and gather fails to evaluate a tall array variable, then you must delete the variable from your workspace and recreate the tall array using only valid commands.

See Also

|

Related Topics