Compile data from excel sheets into a dataset with multiple tags
Show older comments
***** The title is edited to better describe the problem*****
Hello everyone,
The purpose of my data analysis is to compile all the data from excel sheets and plot them.
My approach is to read data from all the sheets in the format of table, and dump them in a cell array as a "container", and then retrive them to another table array while keeping their variable names. I want to preallocate a space to my matrix for a higher efficiency. Since each data matrix has different number of rows, I need to read the maximum row number in the dataset and create a table of that height to hold all the data.
Another special thing is that in each container I have 2 columns of data, and I need to transfer them to the new table column by column, while doing a little modification of the the Variable Names.
The attached code is the best I can do. I dumped them in a "cell" container as tables, and then transferred the data. However the tables are dumped in a matrix instead of a table, hence all the headers are lost. Is there a way I can keep them during the transfer, or can I add a header to a matrix somehow and convert them into a table?
Below is the code and attached is the original file.
% Use pop up box to compile data.
[filename, filepath ]= uigetfile(".xlsx","MultiSelect","on"); % note filename is a cell struture, you cannot use dot reference for this structure.
cd (filepath);
% index the sheets
sheetnumber = numel(sheetnames(filename)) % count the number of sheets in workbook,
% assign enough columns to hold the data. (2 columns per sheet)
% Loop all the sheets and read data in a cell array
for i = 1:sheetnumber
% set import options: Fill level start point
opts_startpoint = spreadsheetImportOptions;
opts_startpoint.Sheet = i;
opts_startpoint.VariableNames = 'Fill level (start';
opts_startpoint.DataRange = 'B2';
opts_startpoint.VariableTypes = 'single';
% preview(filename,opts_startpoint) % preview to QC the output
% read data into variable: startpoint
startpoint = readtable(filename,opts_startpoint,"ReadVariableNames",false); % how to surpress the "header warning"?
% set import options: PPR
opts_PPR = spreadsheetImportOptions;
opts_PPR.Sheet = i;
opts_PPR.VariableNames = 'PPR';
opts_PPR.DataRange = 'F2';
opts_PPR.VariableTypes = 'single';
% preview(filename,opts_PPR) % preview to QC the output
% read data into variable: PPR
PPR = readtable(filename,opts_PPR,"ReadRowNames",false);
% concatenate the tables into 1 block
table = cat(2,startpoint,PPR);
datapool{i} = table;
% to retrieve the data in the cell struture, you need to use {} brackets.
% The result is the content of the indexed cell.
end
% determine maximum row numbers
n1 = max(cellfun(@numel, datapool)) % n1 is total elements in the table, which is twices as the needed number.
n1 = n1/2 % actual number of rows.
% reintialize tData and refill it with valid data
%% here is where I have the issues.
tData = nan(n1,sheetnumber*2) % create the matrix with the right size
for i = 1:sheetnumber
j = i*2-1;
t1 = table2array(datapool{i}(:,1));
T1 = [t1;nan(n1-height(t1),1)]; % fill emptycells with NaNs in the end,
% so all columns has the same number of rows.
tData (:,j) =T1; % put data in the target location.
t2 = table2array(datapool{i}(:,2));
T2 = [t2;nan(n1-height(t2),1)]; % fill emptycells with NaNs in the end,
% so all columns has the same number of rows.
tData (:,j+1) =T2; % put data in the target location.
end
%set color order so lines become light to dark
% Red newcolor = {'#dfcccf','#d5bbbf','#caabb0','#bf9ba1','#b48b92','#a97b83','#9f6b75','#935c67','#884d5a','#7d3e4c','#722e3f','#661e33'}
% Green newcolor = {'#cdddcd','#bcd1bd','#acc6ad','#9cbb9d','#8cb08e','#7ba57e',% '#6b9a6f','#5b8f61','#4b8452',% '#3a7944','#276f36','#0e6428'}
% Blue newcolor = {'#ced2dd','#bdc3d1','#adb5c6','#9ea7bb','#8e99b0','#7e8ba5','#6f7e9a','#607190','#506485','#41577b','#304b70','#304b70'}
%newcolor = {'#ff1111','#6db130','#005dae','#ff4444','#8ed053','#0389ff','#ff7777','#b0de87','#57b1ff','#ffaaaa','#d2ecba','#abd8ff'};
%colororder(newcolor)
%plot the dataset
hold on
for i = 1:sheetnumber
j = 2*i-1
plot(tData(:,j),tData(:,j+1),'LineWidth',2) % x = Start fill level, y = PPR
end
PS: During the coding, I realize that I am struggling with the data types in Matlab. Often I wanted to use a function or to refer to the "data", I ran into some syntax issues. Is there some resource that builds a solid understanding to the data structures?
Thanks a lot,
Charlie
2 Comments
dpb
on 17 Aug 2022
Going through the cell arrays and all is a lot of extra work that is almost certainly totally unneeded --
Can you just back up and explain what you want to do with the data on the various sheets?
Do you want to just group it all together as one dataset or do you need to keep the ID as the sheet name or some other ID? Is there metadata buried in the sheet names that should be/needs be kept and parsed and turned into additional variables?
As a barebones starting point, since all the variables appear to be the same and the dataset sizes are pretty small, I'd just loop through the set and append each sheet to the table as I went; it's not most efficient as opposed to preallocating, but since only doing once and not huge, the time won't be bad.
Something like (caution aircode, not tested...)
fn=... % fully-qualified filename
sheets=sheetnames(fn); % get sheets in workbook
opt=detectImportOptions(fn,'Sheet',1,'Range','A:G'); % build base options object
opt.SelectedVariableNames=opt.SelectedVariableNames([2,3,5,6]); % pick variables of interest
opt.VariableNames([2,3,5,6])={'StartLevel','FinishLevel','Dose','PPR'}; % name as desired/convenient
tPPR=[]; test=[]; % empty placeholders to accumulate into
for i=1:numel(sheets) % iterate over sheets
tmp=readtable(fn,opt,'Sheet',sheets(1)); % read into temporary
test=[test;repmat(sheets(i),height(tmp),1)]; % build sheetnames array to match sheet height
tPPR=[tPPR;tmp]; % add to output table
end
tPPR=addvars(tPPR,categorical(test), ... % add the sheet as set ID
'NewVariableNames',{'DataSet'},'Before',{'StartLevel'});
Now you've got a table that contains all the original data including the ID by sheet that can be used as a grouping variable to easily segregate back out again when/if needed to treat each set independently. For example, it might be interesting to know
tStats=groupsummary(tPPR,{'DataSet'},{'mean','min','max','var'},{'StartLevel','FinishLevel'});
"Salt to suit!"; more specific code examples to match wants/needs could be added if had more explicit description of what analyses are actually needed.
See table and the links there to the page that describes all the possible indexing methods into tables -- they're myriad, but give ability to get anything desired by name, variable type, index, ...
Charlie Chen
on 17 Aug 2022
Accepted Answer
More Answers (1)
Peter Perkins
on 23 Aug 2022
Charlie, I'm confused.
I haven't followed all of what you code does, but it seems that you are lining up tables of different heights horizontally, and then NaN padding to make up the different heights. Unless you have a good reason, that seems funny. I would recommend that you stack them up vertically with an indicator variable showing which sheet they came from, like this:
fname = "FS SM+WP Dataset.xlsx";
nsheets = numel(sheetnames(fname));
for i = 1:nsheets
opts = detectImportOptions(fname,"Sheet",i,"Range","B:F","VariableNamingRule","preserve","TextType","string");
t_i = readtable(fname,opts);
t_i.Sheet = repmat(i,height(t_i),1);
c{i} = t_i;
end
t = vertcat(c{:})
which gives me
t =
949×6 table
Fill level (start Fill level (finish) Note Powder dosed(g) PPR Sheet
_________________ ___________________ _________________________________________________________ _______________ ______ _____
3042.1 3011.1 <missing> 31 3.875 1
3011.1 2982.7 <missing> 28.4 3.55 1
2982.7 2954.9 <missing> 27.8 3.475 1
2954.9 2922 <missing> 32.9 4.1125 1
2922 2887.3 <missing> 34.7 4.3375 1
2887.3 2848 <missing> 39.3 4.9125 1
2848 2803.2 <missing> 44.8 5.6 1
2803.2 2757.6 <missing> 45.6 5.7 1
: : : : : :
3944.1 3813.4 <missing> 130.7 3.2675 7
3813.4 3681.8 "Motor stopped working after that, due to a cable issue." 131.6 3.29 7
NaN NaN <missing> NaN NaN 7
NaN NaN <missing> NaN NaN 7
NaN NaN <missing> NaN NaN 7
NaN NaN <missing> NaN NaN 7
NaN NaN <missing> NaN NaN 7
NaN 40 <missing> NaN NaN 7
Display all 949 rows.
From there, you can do grouped operations on each sheet, including plotting:
hold on
plotEm = @(startFillLevel,PPR) plot(startFillLevel,PPR,'LineWidth',2);
rowfun(plotEm,t,"InputVariables",["Fill level (start" "PPR"],"GroupingVariable","Sheet");
hold off

I may not have captured all of what you are doing, but this comes close with much less code.
3 Comments
Peter Perkins
on 23 Aug 2022
Edited: Peter Perkins
on 23 Aug 2022
I see now that dpb had already done something similar. (Better, actually, he stored the sheet names as categorical!) The main points to understand here is that:
1) lining up horizontally probably isn't the best way to organize the combined data, and
2) there's no reason to use table2array. I think maybe you ended up doing that because to horzcat the tables ytou would have needed to make their var names unique across tables. But the cure was worse than the disease, and in any case, horzcat isn't the way to go.
Charlie Chen
on 23 Aug 2022
Peter Perkins
on 24 Aug 2022
Glad this was helpful. This example talks more about grouped calculations in tables and timetables. It's not exactly the same as what you have, but it might be helpful.
Categories
Find more on Data Preprocessing in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!