MATLAB Answers

0

Mapreduce on parallel cluster - Database or disk full - How control storage of intermediate files?

Asked by Christian on 9 Jul 2019
Latest activity Answered by Christian on 10 Jul 2019
I wish to calculate several statistics (Spectra, Correlation Functions, etc.) of ~400 files with 6e6 doubles per file and afterwards average over all files to get average spectra, correlation functions, etc. To make things fast, I try to use mapreduce on a parallel cluster. This works like a charm as long as there are relatively few files (~100), but with a larger amount of files I get this error message:
Error using parallel.mapreduce.KeyValueOutputStore/addmulti (line 63)
Error in adding keys and values.
Error in Analysis20190708>Analysis (line 115)
addmulti(intermKVStore, {'StatNames'}, {Stats});
Error in parallel.internal.pool.deserialize>@(data,info,intermKVStore)Analysis(data,Parameters,info,intermKVStore)
Error in mapreduce (line 116)
outds = execMapReduce(mrcer, ds, mapfun, reducefun, parsedStruct);
Error in Analysis20190708 (line 72)
outDS = mapreduce(ds, mapper, @reduceAnalysis,inpool);
Caused by:
The database /tmp/filename/TaskOutput7.db is full. (database or disk is full)
The message occurs after around 50% of the map phase is done, but later when I reduce the size of the result vectors (less frequently sampled spectra for example). I checked with an admin and the /tmp indeed has very limited free space.
The question is now: How do I tell MATLAB to store these intermediate(?) files to a different location with more storage

  0 Comments

Sign in to comment.

1 Answer

Answer by Christian on 10 Jul 2019
 Accepted Answer

Figured it out:
The data is stored in the standard directory for temporary files given by
tempdir
Setting the corresponding environment variable to a directory with higher capacity solves the issue, e.g.:
setenv('TMP', 'LargerDirectory')

  0 Comments

Sign in to comment.