How can I utilize tall arrays for my data structure without it being in a cell array?

I've recently inhereted ownership over some legacy software that outputs data in a specific way and that can't be modified. It also outputs a lot of that data, between 2Gb and 8Gb depending on the session. I have MATLAB code that reads all of this data into memory and plots it and that code works most of the time, but starts hitting memory issues around the 8Gb mark. I've gotten as far as I can get with other memory-optimization tricks in my code (mostly making sure I'm not generating copies of such a large data array during intermediate calculations) and now I want to see if I can implement a datastore/tall array setup to have the code work for arbitrarily large datasets in the future.
The data file format is as follows; There are 30 different data sources that we are keeping track of in time, so for every measurement snapshot we take, there are 31 values being stored (one timestamp and one measurement from each of the 30 sources). Each of these values are being converted into a single and the binary values are all stored in sequence in a massive .bin file that contains these snapshots in sequence as well. Once the file gets to about half a gigabyte, a new file is opened and starts being filled up.
So, when you read it back in, I would use (slight psuedocode)
data_ReadIntoMemory = zeros(31,numOfExpectedDataPoints); %numOfExpectedDataPoints can be calculated from the size of all the files
for i = 1:numfiles
fileID = fopen(filename{i},'r','n');
currData = fread(fileID,inf,'*single',0,'ieee-le');
currData = reshape(currData,[31,length(currData)/31]);
data_ReadIntoMemory(:,startPoint:endPoint) = currData; %start/endpoint can be calculated using current and previously loaded file sizes
fclose(fileID)
end
This lets me take all the files, read them in, and create one massive array when I have the memory for it.
For situations where there are too many files, here is how I'm trying to use the datastore functions:
function data = ReadBinFile(filename)
fileID = fopen(filename,'r','n');
data = fread(fileID,inf,'*single',0,'ieee-le');
data = reshape(data,[31,length(data)/31]);
fclose(fileID)
end
myDataStore = fileDatastore(myDirectory,"ReadFcn",@ReadBinFile,"FileExtensions",'.bin') %Some non-.bin files are in the directory so I need to filter for them
data_Datastore = tall(myDataStore)
Now, here's where I start to get issues. If I was to run gather(data_Datastore) on a dataset that was small enough to fit into memory, I don't get the exact same full data array that I get for data_ReadIntoMemory. I get a cell array where each cell represents the data from one file, and I would need to use horzcat(data_Datastore{:}) to get it into the same format as data_ReadIntoMemory.
This is an issue because the intermediate calculations that I need to run before plotting throw errors when they are evaluated on data_Datastore; and the reason is because it is in this cell format.
For example, trying to run a function like gather(data_Datastore(6,:)) throws an error if less than 6 files were loaded in for 'exceeding array bounds' where the bound not to exceed is always the same as the number of files.
All this to frame the question: is there a way to get my data to be read in in a way that isn't a cell array, or doesn't require being fully read in and manipulated first using horzcat (since that would just cause the same issue that started all this with low memory). I know tall arrays can be read in as tables for "tabular" data, but I can't find a way to get this to work with my current data structure.

3 Comments

data=zeros(31,numOfExpectedDataPoints); %numOfExpectedDataPoints can be calculated from the size of all the files
for i = 1:numfiles
fileID = fopen(filename{i},'r','n');
data(:,startPoint:endPoint)= reshape(fread(fileID,inf,'*single',0,'ieee-le'),31,[]);
fclose(fileID)
end
will avoid the extra array.
However, wouldn't the data be records of [timestamp, 31 values} sequentially and wouldn't it be N by 31 array wanted back with the timestamp as the first column in the data array? Looks backward here to me...
How about writing just a little of a couple of these files and attaching them so folks have something that does match your case precisely to fiddle with if they're so inclined?
I've not yet messed around with the datastore, as it seems quite complex and I've managed with memmapfile with what large(ish) datasets have had to mess with recently.
But, datastore documentation specifically limits the output types as being either table/timetable or cell. I don't comprehend that, either, yet; why if the data are all the same type can't return the fundamental data class does seem to be an added complexity to deal with.
I tried writing a .bin file to mimic your description and then create a TallDatastore to it, but that doesn't work directly, either; one would have to go through the above sequence of reading all the files into memory and write them back out with the write method of tall to do that. I haven't gotten far enough along to see about the 'file' type.
I just noticed <FEX submittal cell2underlying> that might solve the actual question asked. Looking at the code it seems pretty complex, but it has high marks.

Sign in to comment.

 Accepted Answer

You can use UniformRead=true in your fileDatastore to avoid the need for cell2underlying. Like this:
dirname = fullfile(tempdir, "data_files");
if ~isfolder(dirname)
mkdir(dirname);
end
makeSomeFiles(dirname);
% Without UniformRead - get a cell array
t1 = tall(fileDatastore(dirname, ReadFcn=@ReadBinFile, FileExtensions=".bin"));
head(t1)
Evaluating tall expression using the Local MATLAB Session: - Pass 1 of 1: 0% complete - Pass 1 of 1: Completed in 0.51 sec Evaluation completed in 0.67 sec {100×31 single} {100×31 single} {100×31 single} {100×31 single} {100×31 single} {100×31 single} {100×31 single} {100×31 single}
% With UniformRead - get numeric array
t2 = tall(fileDatastore(dirname, ReadFcn=@ReadBinFile, ...
FileExtensions=".bin", UniformRead=true));
head(t2)
Evaluating tall expression using the Local MATLAB Session: - Pass 1 of 1: 0% complete - Pass 1 of 1: Completed in 0.072 sec Evaluation completed in 0.16 sec Columns 1 through 18 0.3289 0.9308 0.4696 0.3568 0.6754 0.5802 0.4179 0.6907 0.3941 0.6753 0.6750 0.1697 0.6579 0.7239 0.1972 0.3395 0.5684 0.5779 0.0016 0.1957 0.1327 0.4781 0.1755 0.6272 0.1420 0.1352 0.0553 0.6035 0.4224 0.7938 0.8141 0.2770 0.0373 0.7989 0.3175 0.3785 0.4750 0.1540 0.2161 0.7176 0.2297 0.3621 0.8475 0.2358 0.6638 0.1494 0.9366 0.7775 0.3787 0.1691 0.3113 0.2292 0.9661 0.5620 0.4041 0.6139 0.4539 0.8139 0.1485 0.8899 0.7140 0.9949 0.7610 0.1544 0.9431 0.8052 0.2934 0.3014 0.6858 0.7874 0.7229 0.9528 0.2361 0.6919 0.2924 0.5637 0.1011 0.0381 0.3046 0.4321 0.2706 0.9510 0.6782 0.1797 0.9313 0.6921 0.2593 0.9228 0.8298 0.9948 0.9111 0.8507 0.8993 0.2347 0.2011 0.8509 0.6860 0.4809 0.7143 0.5698 0.6604 0.5687 0.8197 0.8609 0.8439 0.3772 0.1328 0.8385 0.3509 0.2323 0.2379 0.9308 0.5911 0.3410 0.8434 0.3656 0.2713 0.1418 0.6095 0.7457 0.9086 0.1395 0.3453 0.9494 0.2587 0.0088 0.3134 0.6405 0.2133 0.6634 0.3789 0.6891 0.4150 0.6345 0.8739 0.6589 0.0599 0.3294 0.8263 0.9148 0.9762 0.6857 0.2569 0.6316 Columns 19 through 31 0.8879 0.9345 0.1042 0.6832 0.3236 0.6281 0.9984 0.4885 0.9548 0.6040 0.0457 0.6696 0.2132 0.5941 0.3019 0.5208 0.5237 0.4337 0.2236 0.7818 0.2278 0.4154 0.0931 0.9749 0.3032 0.3566 0.7178 0.2134 0.2635 0.3125 0.9260 0.6741 0.6713 0.6337 0.0771 0.7141 0.8272 0.6493 0.0761 0.1354 0.6295 0.0368 0.3487 0.8600 0.3651 0.1600 0.2800 0.0557 0.3963 0.5267 0.1156 0.4217 0.3538 0.0969 0.8557 0.9881 0.8558 0.2145 0.8945 0.4767 0.1730 0.0303 0.7228 0.4689 0.1463 0.7149 0.0021 0.3806 0.4231 0.3380 0.0049 0.3643 0.5080 0.7519 0.9627 0.9779 0.6296 0.3157 0.7360 0.0637 0.9878 0.0726 0.5481 0.3348 0.3130 0.6760 0.6201 0.7845 0.4610 0.1913 0.3020 0.7800 0.6866 0.3556 0.6867 0.8253 0.4901 0.8228 0.3320 0.5700 0.0769 0.5974 0.5506 0.9041
%% Read binary file
function data = ReadBinFile(filename)
fh = fopen(filename, "rb");
data = fread(fh, Inf, "*single");
data = reshape(data, [], 31);
end
%% Make some random data
function makeSomeFiles(dirname)
for ii = 1:10
fh = fopen(fullfile(dirname, sprintf("file_%03d.bin", ii)), "wb");
assert(fh > 0);
fwrite(fh, rand(100,31,"single"),"single");
fclose(fh);
end
end

9 Comments

@Edric Ellis, that is most useful, indeed. I think the doc ought to be upgraded to mention it; I never saw it although I will admit I didn't read every line of every page, I did do quite a lot of looking...
So the final file of my dataset will be some arbitrarily smaller size depending on exactly when the software is stopped, which means that I get an error trying to run with UniformOutput=true. However, if I just add a bit to my ReadBinFile that detects when we've reached the smaller file and just add a bunch of zeros to make it the same size, everything works as I'd hope it would!
@Aaron Coville-- I can't find what I saw initially that I linked to the datastore doc page in initial response, but it said the output would be either a table/timetable or a cell array and didn't mention there being the option for setting UniformRead to be able to return the underlying data class/array. My comment was intended to have that other option mentioned there...but, now, as is so often the case given the huge amount of doc, I can't find what it was I was looking at at the time...but it was what I came across first then.
If you/Mathworks can locate that piece of doc (unless it has been since modified so the particular verbiage no longer exists although would have had to been since yesterday), I would recommend this be added there as well. Sorry I can't point directly at it now, but I was looking at it when I wrote @Aaron Coville about his initial Q?.
@dpb I'm guessing you saw the language on the tall - Create tall array - MATLAB page. Specifically the first description for inputs of type datastore.
@Edric Ellis To ammend my previous comment after checking the link you posted, it turns out that since I was concatennating horizontally, that's why I needed to add all those extra zeros. Once I changed my data to be columns instead of rows, and thus concatennating vertically, I didn't have to do that. Thanks!
@Aaron Coville, @Edric Ellis -- Indeed, that is the doc page I saw and from which drew the wrong conclusion that it would not be possible to retrieve anything other than the cell array. That's where I believe there needs to be an additional comment on the cell array about there being the facility within the datastore creation to return the underlying class as an array to avoid other novices in the area(*) from drawing the same wrong conclusion -- because, that is certainly what that doc page says, unequivocally.
(*) I've used MATLAB since R3, but retired from the consulting gig during which the tall array could have been quite useful for similar types of data quite a number of years before it was introduced so haven't ever had the call since...
@dpb Ah, I see. Yes, I agree that the first section of the "Description" does definitely imply that tall(ds) is either a tall table, tall timetable, or tall cell. I'll report this to our doc team. Thanks for pointing out exactly where the problem is!
Thanks. I knew I was looking directly at the statement when I wrote the original response! <g>.
I stopped looking after seing that figuring it was useless; hence I didn't see (or go looking for) the 'UniformRead' named parameter...although it did seem like a strange restriction.

Sign in to comment.

More Answers (0)

Products

Release

R2022a

Tags

Asked:

on 29 Jul 2025

Edited:

dpb
on 31 Jul 2025

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!