Tall Skinny QR (TSQR) Matrix Factorization Using MapReduce

Open Live Script

This example shows how to compute a tall skinny QR (TSQR) factorization using mapreduce. It demonstrates how to chain mapreduce calls to perform multiple iterations of factorizations, and uses the info argument of the map function to compute numeric keys.

Prepare Data

Create a datastore using the airlinesmall.csv data set. This 12-megabyte data set contains 29 columns of flight information for several airline carriers, including arrival and departure times. In this example, the variables of interest are ArrDelay (flight arrival delay), DepDelay (flight departure delay) and Distance (total flight distance).

ds = tabularTextDatastore('airlinesmall.csv', 'TreatAsMissing', 'NA');
ds.ReadSize = 1000;
ds.SelectedVariableNames = {'ArrDelay', 'DepDelay', 'Distance'};

The datastore treats 'NA' values as missing and replaces the missing values with NaN values by default. The ReadSize property lets you specify how to partition the data into blocks. Additionally, the SelectedVariableNames property allows you to work with only the specified variables of interest, which you can verify using preview.

preview(ds)

ans=8×3 table
    ArrDelay    DepDelay    Distance
    ________    ________    ________

        8          12         308   
        8           1         296   
       21          20         480   
       13          12         296   
        4          -1         373   
       59          63         308   
        3          -2         447   
       11          -1         954

Chain MapReduce Calls

The implementation of the multi-iteration TSQR algorithm needs to chain consecutive mapreduce calls. To demonstrate the general chaining design pattern, this example uses two mapreduce iterations. The output from the map function calls is passed into a large set of reducers, and then the output of these reducers becomes the input for the next mapreduce iteration.

First MapReduce Iteration

In the first iteration, the map function, tsqrMapper, receives one block (the ith) of data, which is a table of size $N_{i} \times 3$ . The mapper computes the $R$ matrix of this block of data and stores it as an intermediate result. Then, mapreduce aggregates the intermediate results by unique key before sending them to the reduce function. Thus, mapreduce sends all intermediate $R$ matrices with the same key to the same reducer.

Since the reducer uses qr, which is an in-memory MATLAB® function, it's best to first make sure that the $R$ matrices fit in memory. This example divides the dataset into eight partitions. The mapreduce function reads the data in blocks and passes the data along with some meta information to the map function. The info input argument is the second input to the map function and it contains the read offset and file size information that are necessary to generate the key,

  key = ceil(offset/fileSize/numPartitions).

Display the map function file.

function tsqrMapper(data, info, intermKVStore)
  x = data{:,:};
  x(any(isnan(x),2),:) = [];% Remove missing values
  [~, r] = qr(x,0);

  % intermKey = randi(4); % random integer key for partitioning intermediate results
  intermKey = computeKey(info, 8);
  add(intermKVStore,intermKey, r);

  function key = computeKey(info, numPartitions)
    fileSize = info.FileSize; % total size of the underlying data file
    partitionSize = fileSize/numPartitions; % size in bytes of each partition
    offset = info.Offset; % offset in bytes of the current read
    key = ceil(offset/partitionSize);
  end
end

The reduce function receives a list of the intermediate $R$ matrices, vertically concatenates them, and computes the $R$ matrix of the concatenated matrix.

Display the reduce function file.

function tsqrReducer(intermKey, intermValIter, outKVStore)
  x = [];

  while (intermValIter.hasnext)
    x = [x;intermValIter.getnext];
  end
  % Note that this approach assumes the concatenated intermediate values fit
  % in memory. Consider increasing the number of reduce tasks (increasing the
  % number of partitions in the tsqrMapper) and adding more iterations if it
  % does not fit in memory.

  [~, r] =qr(x,0);

  add(outKVStore,intermKey,r);
end

Use mapreduce to apply the map and reduce functions to the datastore, ds.

outds1 = mapreduce(ds, @tsqrMapper, @tsqrReducer);

********************************
*      MAPREDUCE PROGRESS      *
********************************
Map   0% Reduce   0%
Map  10% Reduce   0%
Map  20% Reduce   0%
Map  30% Reduce   0%
Map  40% Reduce   0%
Map  50% Reduce   0%
Map  60% Reduce   0%
Map  70% Reduce   0%
Map  80% Reduce   0%
Map  90% Reduce   0%
Map 100% Reduce   0%
Map 100% Reduce  11%
Map 100% Reduce  22%
Map 100% Reduce  33%
Map 100% Reduce  44%
Map 100% Reduce  56%
Map 100% Reduce  67%
Map 100% Reduce  78%
Map 100% Reduce  89%
Map 100% Reduce 100%

mapreduce returns an output datastore, outds1, with files in the current folder.

Second MapReduce Iteration

The second iteration uses the output of the first iteration, outds1, as its input. This iteration uses an identity mapper, identityMapper, which simply copies over the data using a single key, 'Identity'.

Display the identity mapper file.

function identityMapper(data, info, intermKVStore)
  % This mapper function simply copies the data and add them to the
  % intermKVStore as intermediate values.
  x = data.Value{:,:};
  add(intermKVStore,'Identity', x);
end

The reducer function is the same in both iterations. The use of a single key by the map function means that mapreduce only calls the reduce function once in the second iteration.

Use mapreduce to apply the identity mapper and the same reducer to the output from the first mapreduce call.

outds2 = mapreduce(outds1, @identityMapper, @tsqrReducer);

********************************
*      MAPREDUCE PROGRESS      *
********************************
Map   0% Reduce   0%
Map  11% Reduce   0%
Map  22% Reduce   0%
Map  33% Reduce   0%
Map  44% Reduce   0%
Map  55% Reduce   0%
Map  66% Reduce   0%
Map  77% Reduce   0%
Map  88% Reduce   0%
Map 100% Reduce   0%
Map 100% Reduce 100%

View Results

Read the final results from the output datastore.

r = readall(outds2);
r.Value{:}

ans = 3×3
10⁵ ×

    0.1091    0.0893    0.5564
         0   -0.0478   -0.4890
         0         0    3.0130

References

Constantine, Paul G., and David F. Gleich. “Tall and Skinny QR Factorizations in MapReduce Architectures.” Proceedings of the Second International Workshop on MapReduce and Its Applications, ACM, 2011, pp. 43–50. https://dl.acm.org/doi/10.1145/1996092.1996103