Statistics of datastore of tabular data

2 views (last 30 days)
Hey all,
I have thousands of parquet files. Each file has more than 50,000 rows of numerical data with more than 100 columns each. My data can't fit in memory so I use datastores to import and handle the data for machine learning workflow downstream. I would like to know if it is possible to calculate some statistics (max, min, mean, std for each channel) of each file during the datastore creation process, which I can use afterwards to filter and select the relevant segments of data for my downstream analysis.
Thanks in advance

Accepted Answer

Abhas
Abhas on 26 Mar 2024
Hi Omar,
To calculate statistics (max, min, mean, std for each channel) during the datastore creation process in MATLAB and use them for filtering and selecting relevant data segments for downstream analysis, you can follow these steps:
  1. Create a Datastore: Initialize a 'datastore' for your Parquet files.
  2. Define Custom Function: Create a function to compute the desired statistics for each chunk of data.
  3. Apply Transformation: Use the 'transform' function to apply your custom statistics calculation to the datastore.
  4. Read and Aggregate Statistics: Iterate over the datastore to read the statistics of each chunk and aggregate them globally.
  5. Use Statistics for Filtering: Leverage the aggregated statistics to filter and select relevant data segments.
Here's the MATLAB code to reflect the above steps:
% Step 1: Create Your Datastore
ds = parquetDatastore('path/to/your/parquet/files/*.parquet');
% Step 2: Define Your Custom Function
function statsTable = calculateStats(tbl)
statsTable = varfun(@min, tbl, 'OutputFormat', 'table');
statsTable.Properties.VariableNames = strcat(statsTable.Properties.VariableNames, '_min');
maxTable = varfun(@max, tbl, 'OutputFormat', 'table');
maxTable.Properties.VariableNames = strcat(maxTable.Properties.VariableNames, '_max');
statsTable = [statsTable, maxTable];
meanTable = varfun(@mean, tbl, 'OutputFormat', 'table');
meanTable.Properties.VariableNames = strcat(meanTable.Properties.VariableNames, '_mean');
statsTable = [statsTable, meanTable];
stdTable = varfun(@std, tbl, 'OutputFormat', 'table');
stdTable.Properties.VariableNames = strcat(stdTable.Properties.VariableNames, '_std');
statsTable = [statsTable, stdTable];
end
% Step 3: Apply the Transformation
ds = transform(ds, @calculateStats);
% Step 4: Read and Aggregate the Statistics
globalMin = inf; % Initialize for min. Do similarly for max, mean, std
while hasdata(ds)
statsChunk = read(ds);
chunkMin = min(table2array(statsChunk(:, contains(statsChunk.Properties.VariableNames, '_min'))), [], 'all');
globalMin = min(globalMin, chunkMin);
% Update global max, mean, std similarly
end
% At this point, globalMin (and other statistics) can be used for filtering and selecting relevant data segments
At this point, you have the aggregated statistics (e.g., globalMin) which you can use to filter and select relevant segments of your data for further analysis.
You may refer to the following documentation links to have a better understanding on working with datastore and transform in MATLAB:
  1. parquetDatastore: https://www.mathworks.com/help/matlab/ref/matlab.io.datastore.parquetdatastore.html?s_tid=doc_ta
  2. transform: https://www.mathworks.com/help/matlab/ref/matlab.io.datastore.transform.html?s_tid=doc_ta
  1 Comment
Omar Kamel
Omar Kamel on 28 Mar 2024
Hi Abhas, Thanks a lot for the elaborate answer. This is what I was exactly looking for.

Sign in to comment.

More Answers (0)

Categories

Find more on Data Preprocessing in Help Center and File Exchange

Products


Release

R2023b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!