Compile data from excel sheets into a dataset with multiple tags

Question

0 votes

FS SM+WP Dataset.xlsx

***** The title is edited to better describe the problem*****

Hello everyone,

The purpose of my data analysis is to compile all the data from excel sheets and plot them.

My approach is to read data from all the sheets in the format of table, and dump them in a cell array as a "container", and then retrive them to another table array while keeping their variable names. I want to preallocate a space to my matrix for a higher efficiency. Since each data matrix has different number of rows, I need to read the maximum row number in the dataset and create a table of that height to hold all the data.

Another special thing is that in each container I have 2 columns of data, and I need to transfer them to the new table column by column, while doing a little modification of the the Variable Names.

The attached code is the best I can do. I dumped them in a "cell" container as tables, and then transferred the data. However the tables are dumped in a matrix instead of a table, hence all the headers are lost. Is there a way I can keep them during the transfer, or can I add a header to a matrix somehow and convert them into a table?

Below is the code and attached is the original file.

% Use pop up box to compile data.
[filename, filepath ]= uigetfile(".xlsx","MultiSelect","on"); % note filename is a cell struture, you cannot use dot reference for this structure.
cd (filepath);
% index the sheets
sheetnumber = numel(sheetnames(filename)) % count the number of sheets in workbook, 
% assign enough columns to hold the data. (2 columns per sheet)
% Loop all the sheets and read data in a cell array  
for i = 1:sheetnumber
% set import options: Fill level start point
    opts_startpoint = spreadsheetImportOptions;
    opts_startpoint.Sheet = i;
    opts_startpoint.VariableNames = 'Fill level (start';
    opts_startpoint.DataRange = 'B2';
    opts_startpoint.VariableTypes = 'single';
    % preview(filename,opts_startpoint) % preview to QC the output
    % read data into variable: startpoint
    startpoint = readtable(filename,opts_startpoint,"ReadVariableNames",false); % how to surpress the "header warning"?
    % set import options: PPR
    opts_PPR = spreadsheetImportOptions;
    opts_PPR.Sheet = i;
    opts_PPR.VariableNames = 'PPR';
    opts_PPR.DataRange = 'F2';
    opts_PPR.VariableTypes = 'single';
    % preview(filename,opts_PPR) % preview to QC the output
    % read data into variable: PPR
    PPR = readtable(filename,opts_PPR,"ReadRowNames",false);
    % concatenate the tables into 1 block
    table = cat(2,startpoint,PPR);
    datapool{i} = table;
    % to retrieve the data in the cell struture, you need to use {} brackets.
    % The result is the content of the indexed cell.
  end
% determine maximum row numbers
n1 = max(cellfun(@numel, datapool)) % n1 is total elements in the table, which is twices as the needed number. 
n1 = n1/2 % actual number of rows.
% reintialize tData and refill it with valid data
%% here is where I have the issues.
tData = nan(n1,sheetnumber*2) % create the matrix with the right size
for i = 1:sheetnumber
    j = i*2-1;
        t1 = table2array(datapool{i}(:,1));
        T1 = [t1;nan(n1-height(t1),1)]; % fill emptycells with NaNs in the end, 
        % so all columns has the same number of rows.     
        tData (:,j) =T1; % put data in the target location.
        t2 = table2array(datapool{i}(:,2));
        T2 = [t2;nan(n1-height(t2),1)]; % fill emptycells with NaNs in the end, 
        % so all columns has the same number of rows.     
        tData (:,j+1) =T2; % put data in the target location.
end
%set color order so lines become light to dark
% Red newcolor = {'#dfcccf','#d5bbbf','#caabb0','#bf9ba1','#b48b92','#a97b83','#9f6b75','#935c67','#884d5a','#7d3e4c','#722e3f','#661e33'}
% Green newcolor = {'#cdddcd','#bcd1bd','#acc6ad','#9cbb9d','#8cb08e','#7ba57e',% '#6b9a6f','#5b8f61','#4b8452',% '#3a7944','#276f36','#0e6428'}
% Blue newcolor = {'#ced2dd','#bdc3d1','#adb5c6','#9ea7bb','#8e99b0','#7e8ba5','#6f7e9a','#607190','#506485','#41577b','#304b70','#304b70'}
%newcolor = {'#ff1111','#6db130','#005dae','#ff4444','#8ed053','#0389ff','#ff7777','#b0de87','#57b1ff','#ffaaaa','#d2ecba','#abd8ff'};
%colororder(newcolor)
%plot the dataset
hold on
for i = 1:sheetnumber
    j = 2*i-1
plot(tData(:,j),tData(:,j+1),'LineWidth',2) % x = Start fill level, y = PPR
end

PS: During the coding, I realize that I am struggling with the data types in Matlab. Often I wanted to use a function or to refer to the "data", I ran into some syntax issues. Is there some resource that builds a solid understanding to the data structures?

Thanks a lot,

Charlie

2 Comments
Show None Hide None

dpb on 17 Aug 2022

Open in MATLAB Online

Going through the cell arrays and all is a lot of extra work that is almost certainly totally unneeded --

Can you just back up and explain what you want to do with the data on the various sheets?

Do you want to just group it all together as one dataset or do you need to keep the ID as the sheet name or some other ID? Is there metadata buried in the sheet names that should be/needs be kept and parsed and turned into additional variables?

As a barebones starting point, since all the variables appear to be the same and the dataset sizes are pretty small, I'd just loop through the set and append each sheet to the table as I went; it's not most efficient as opposed to preallocating, but since only doing once and not huge, the time won't be bad.

Something like (caution aircode, not tested...)

fn=...                                                          % fully-qualified filename
sheets=sheetnames(fn);                                          % get sheets in workbook
opt=detectImportOptions(fn,'Sheet',1,'Range','A:G');            % build base options object
opt.SelectedVariableNames=opt.SelectedVariableNames([2,3,5,6]); % pick variables of interest
opt.VariableNames([2,3,5,6])={'StartLevel','FinishLevel','Dose','PPR'}; % name as desired/convenient
tPPR=[]; test=[];                                               % empty placeholders to accumulate into    
for i=1:numel(sheets)                                           % iterate over sheets
  tmp=readtable(fn,opt,'Sheet',sheets(1));                      % read into temporary
    test=[test;repmat(sheets(i),height(tmp),1)];                % build sheetnames array to match sheet height
    tPPR=[tPPR;tmp];                                            % add to output table
end
tPPR=addvars(tPPR,categorical(test), ...                        % add the sheet as set ID
    'NewVariableNames',{'DataSet'},'Before',{'StartLevel'});

Now you've got a table that contains all the original data including the ID by sheet that can be used as a grouping variable to easily segregate back out again when/if needed to treat each set independently. For example, it might be interesting to know

tStats=groupsummary(tPPR,{'DataSet'},{'mean','min','max','var'},{'StartLevel','FinishLevel'});

"Salt to suit!"; more specific code examples to match wants/needs could be added if had more explicit description of what analyses are actually needed.

See table and the links there to the page that describes all the possible indexing methods into tables -- they're myriad, but give ability to get anything desired by name, variable type, index, ...

Charlie Chen on 17 Aug 2022

Thank you dpb for the response. Your insight is much appreciated.

The idea of your example is very clean and efficient. I definetly will incorprate it into my code.

I think all the hussle I had with the code is that all the manipulation can be easily done with matrices, but I HAD to do it with tables because it can retain some context information in variable names.

As background information, the test is to see the consistence of a particular instrument in different conditions. On the sheet name, there are 4 pieces of information(meta data?) : Test Date "0809", Test order "-1", Powder type "SM", Valve type "FL". On each sheet, 2 data are needed, Fill level (input) and PPR (output).

After thinking about your questions, I think a better way to do this is to design a data structure to present all the data together. Something looking like:

When I plot the data, I would group all the data with the same valve type, and sort it by the date and order.

Do you think it's feasible? If yes, what data structure would you suggest to use?

Best regards,

Charlie

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

dpb on 17 Aug 2022

Edited: dpb on 17 Aug 2022

0 votes

Sure it's possible. Just parse the sheet names when reading them and do similarly as I did excepting with Date, Valve, Powder variables. I figured that was likely to be the case with the format of the sheet names. It's ok for spreadsheets, but definitely not the way to build a dataset for analysis.

Definitely put the metadata data in the table as variables even though the same values are repeated; then they're available for grouping variables for the rowfun kind of analysis by whatever (combination of) variables are needed from one-at-a-time to in aggregate total depending upon what you need. Or they're available as factors or levels for ANOVA or such if it comes to testing some hypothesis of equality of means or the like.

The results from addressing a table variable with the "dot" notation are arrays so matrix operations still come naturally; there's no need to do any other machinations to build corollary array variables to use the table -- there are, almost certainly, already all the builtin tools to handle table data that you'll need for anything you have in mind or realize is a new idea when get into it.

ADDENDUM

" it can retain some context information in variable names..."

I didn't see it here, but same thing goes for variable names as for the sheet names -- if there's a variable encoded in a variable name by "X1, X2" where the "1,2" stands for a different test level or controlled variable or whatever, then turn it into just "X" and and the other data as a variable (maybe categorical).

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

Answer 2

Peter Perkins on 23 Aug 2022

Open in MATLAB Online

0 votes

Charlie, I'm confused.

I haven't followed all of what you code does, but it seems that you are lining up tables of different heights horizontally, and then NaN padding to make up the different heights. Unless you have a good reason, that seems funny. I would recommend that you stack them up vertically with an indicator variable showing which sheet they came from, like this:

fname = "FS SM+WP Dataset.xlsx";
nsheets = numel(sheetnames(fname));
for i = 1:nsheets
    opts = detectImportOptions(fname,"Sheet",i,"Range","B:F","VariableNamingRule","preserve","TextType","string");
    t_i = readtable(fname,opts);
    t_i.Sheet = repmat(i,height(t_i),1);
    c{i} = t_i;
end
t = vertcat(c{:})

which gives me

t =
949×6 table
Fill level (start    Fill level (finish)                              Note                               Powder dosed(g)     PPR      Sheet
_________________    ___________________    _________________________________________________________    _______________    ______    _____
     3042.1                3011.1           <missing>                                                            31          3.875      1  
     3011.1                2982.7           <missing>                                                          28.4           3.55      1  
     2982.7                2954.9           <missing>                                                          27.8          3.475      1  
     2954.9                  2922           <missing>                                                          32.9         4.1125      1  
       2922                2887.3           <missing>                                                          34.7         4.3375      1  
     2887.3                  2848           <missing>                                                          39.3         4.9125      1  
       2848                2803.2           <missing>                                                          44.8            5.6      1  
     2803.2                2757.6           <missing>                                                          45.6            5.7      1  
:                     :                                         :                                       :             :         :  
     3944.1                3813.4           <missing>                                                         130.7         3.2675      7  
     3813.4                3681.8           "Motor stopped working after that, due to a cable issue."         131.6           3.29      7  
        NaN                   NaN           <missing>                                                           NaN            NaN      7  
        NaN                   NaN           <missing>                                                           NaN            NaN      7  
        NaN                   NaN           <missing>                                                           NaN            NaN      7  
        NaN                   NaN           <missing>                                                           NaN            NaN      7  
        NaN                   NaN           <missing>                                                           NaN            NaN      7  
        NaN                    40           <missing>                                                           NaN            NaN      7  
     Display all 949 rows.

From there, you can do grouped operations on each sheet, including plotting:

     hold on
     plotEm = @(startFillLevel,PPR) plot(startFillLevel,PPR,'LineWidth',2);
     rowfun(plotEm,t,"InputVariables",["Fill level (start" "PPR"],"GroupingVariable","Sheet");
     hold off

I may not have captured all of what you are doing, but this comes close with much less code.

3 Comments
Show 1 older comment Hide 1 older comment

Charlie Chen on 23 Aug 2022

Hey Peter,

Thanks for pitching in your 2 cents. Great code, and more importantly the insight on the apporach orgainzing the data.

This less-than-50 line code has taken me roughly a week, with some painful research on different data types. I guess the better way is to improve the angle of attack.

I have managed to create a "database" style compiled data but now encountering chanllenges to filter them and plot them in groups. I see you have already did what I want with a much simpler way. Seems like a missing puzzle for making it possible is the "rowfun".

I accept before doing things elegantly I will make some funny mistakes. I will look into your logic and incorprate it in my project.

Best regards,

Charlie

Peter Perkins on 24 Aug 2022

Glad this was helpful. This example talks more about grouped calculations in tables and timetables. It's not exactly the same as what you have, but it might be helpful.

Sign in to comment.

Compile data from excel sheets into a dataset with multiple tags

2 Comments
Show None Hide None

Accepted Answer

0 Comments
Show -2 older comments Hide -2 older comments

More Answers (1)

3 Comments
Show 1 older comment Hide 1 older comment

Categories

Products

Release

Tags

Community Treasure Hunt

Compile data from excel sheets into a dataset with multiple tags

2 Comments Show None Hide None

Accepted Answer

0 Comments Show -2 older comments Hide -2 older comments

More Answers (1)

3 Comments Show 1 older comment Hide 1 older comment

Categories

Products

Release

Tags

See Also

Community Treasure Hunt

2 Comments
Show None Hide None

0 Comments
Show -2 older comments Hide -2 older comments

3 Comments
Show 1 older comment Hide 1 older comment