Merging specific Excel files according to a string in the filename

After searching through the forums I've found answers relating to how to merge Excel files together, but none that relate specifically to what I wish to do. I have thousands of Excel files, from a sample set of 120 samples, all in one folder. There are thus 120 sets of files which are separated by time and within each set they all have the same particular string in their filename which allows me to identify them, based on the time the measurements in each file start and end. I wish to merge each set of files together into one file but as the number of files in each set is variable, I wonder if there is a way to count the number of occurrences and only merge those occurrences together. hist and unique doesn't work because I have both alphabetical and numerical characters for the identifiers.
What I've done so far is first load all of the files, tabulate the filenames and split up the filenames so that the unique identifiers are isolated. I adapted a
Files=dir('\\fscbeng\Project\Lab_Measurements\Results\20190424-1\*.csv'); % Looks out for Excel spreadsheets specifically
num_files=1:length(Files);
Cell_logs = cell(1,num_files(end));
for r = num_files
filename=([Files(r).folder '\' Files(r).name]);
fileID = fopen(filename,'r');
Cell_logs(r) = textscan(fileID, '%s', 'delimiter', '\r'); % Stores each Excel file in a separate cell
fclose(fileID);
end
Filenames = table({Files.name}.');
for i = 1:size(Filenames,1)
x = cell2mat(Filenames{i,1});
Filenames{i,1} = {x(1:7)};
Filenames{i,2} = {x(8:23)};
Filenames{i,3} = {x(24:25)};
end
Filenames.Properties.VariableNames = {'Sensor_type' 'UID' 'Sensor_position'};
Sensor_position contains the identifier as follows:
Filenames.Sensor_position
ans =
1320×1 cell array
'80'
'80'
'80'
'80'
'80'
'80'
'80'
'80'
'80'
'80'
'80'
'81'
'81'
'81'
'81'
'81'
...
So there are 11 files for each sensor position, so one way I've thought of (but don't know how to fully execute as I haven't done this before) is a for loop with a logical expression where:
v=Filenames.Sensor_position;
for i = 1:length(v)-1
if isequal(v(i), v(i+1))
... % command to continue until the values differ
else
... % the consecutive values have been evaluated to be different so the column of values would be cut here and the files merged up until the particular row number
end
end
Idea being that when the for loop encounters the next value in the column which differs from the previous value it would split it off into a new variable (if I include an else condition) or else earmark each section of the Filenames.Sensor_position so that the file merger only happens for those earmarked files, if that makes any sense. The other option I've thought of is a find command which would process only those files with a particular string (e.g. '80', '81' etc.) but it would require me to specify each sensor position in turn (there are 120 of them) which seems quite clunky to me.

6 Comments

I'd recommend you use fullfile to build paths rather than string concatenation with a hardcoded separator.
Your excel files are not excel files if you use textscan to read them. They must be some text files (possibly .csv that Windows by default associate with excel)
The way you construct your table is very awkward, you could use:
FileNames = char({Files.Names}); %convert file names into a 2D char array
FileNames = FileNames(:, 1:25); %only keep the first 25 chars of each name
FileNames = cell2table(mat2cell(FileNames, size(FileNames, 1), [7, 16, 2]), 'VariableNames', {'Sensor_type', 'UID', 'Sensor_position'});
After that I'm a bit unclear on how you want to group your files. Should the content of all files with the same Sensor_type and UID be merged vertically by Sensor_Position order? Or something else?
You probably don't need a loop for that.
Yes apologies, they are .csv files which are associated with Excel.
I used that method as I wanted to isolate the sensor position variable without knowing about the other commands cell2table and mat2cell, but yes your code is cleaner. Have now replaced that, thanks.
EDIT: With regards to the third cell of Filenames as per your code, all the sensor positions have been merged together into one string, is that intentional?
Filenames =
1×3 table
Sensor_type UID Sensor_position
_____________ ______________ _______________
[1320×7 char] [1320×16 char] [1320×2 char]
So the content of all files with the same Sensor_Position should be merged together (but preserving the header). What do you think about using a while loop nested in the for loop, e.g. while isequal(v(i), v(i+1))? I'm not entirely sure about using v(i+1), however, as that only takes into account the value immediately after i, rather than searching for the index of the first value which differs from i.
Oops! That wasn't the intent to have just one row in the table.
FileNames = char({Files.Names}); %convert file names into a 2D char array
FileNames = FileNames(:, 1:25); %only keep the first 25 chars of each name
FileNames = cell2table(mat2cell(FileNames, ones(1, size(FileNames, 1)), [7, 16, 2]), 'VariableNames', {'Sensor_type', 'UID', 'Sensor_position'});
will fix it.
At present, you're reading each file with textscan(...,'%s') so it's unclear which format the files have. Can you attach a sample file (in particular so we can see what form the header takes)?
So you want to group by Sensor_Position. Do the Sensor_type and UID have any impact on the grouping or are they just ignored then?
As said, a loop is most likely not needed. You can just ask matlab to merge all elements of Cell_logs which have the same matching Sensor_Position.
Sample file attached. Sensor_type and UID have no impact, only Sensor_Position.
How would I go about that? I know how to merge them normally but how would I include the condition for matching sensor position irrespective of file number?
So what does preserving the header mean with such files. What would be the result of merging two files similar to your test.csv?
I would probably use rowfun with the 'GroupBy' option. I'll show you how it's done once I understand what merging actually means.
What I do in that case is use something similiar to:
for p=num_files(1,2:end)
for t=Cell_logs{1,p}
StartRow=(find(strcmp('Date/Time Millis Seqence_index MFC-A Set MFC-A Flow MFC-A Gas MFC-A State MFC-A Pressure MFC-A Temp MFC-B Set MFC-B Flow MFC-B Gas MFC-B State MFC-B Pressure MFC-B Temp MFC-C Set MFC-C Flow MFC-C Gas MFC-C State MFC-C Pressure MFC-C Temp MFC-D Set MFC-D Flow MFC-D Gas MFC-D State MFC-D Pressure MFC-D Temp MFC-E Set MFC-E Flow MFC-E Gas MFC-E State MFC-E Pressure MFC-E Temp MFC-F Set MFC-F Flow MFC-F Gas MFC-F State MFC-F Pressure MFC-F Temp MFM-M Flow MFM-M Pressure MFM-M Temp',t)))+1;
end
end
which I used for another script where it preserves the header for the first file and then for subsequent files, where it is no longer needed, the merging instead ignores the header for the subsequent files. So I only need the header for the first file for every sensor. Header in this case meaning everything up to and including the column headers of the data, which is the purple text in the parentheses.
I.e. (Header + data of first file) + data of 2nd file + data of 3rd file etc.

Sign in to comment.

 Accepted Answer

folder = '\\fscbeng\Project\Lab_Measurements\Results\20190424-1';
filelist = dir(fullfile(folder, '*.csv'));
filedata = cell(numel(filelist), 2);
for fileidx = 1:numel(filelist)
filecontent = fileread(fullfile(folder, filelist(fileidx).name)); %read whole file at once
[header, data] = regexp(filecontent, '^.*=+ DATA[^\n\r]+[\n\r]+[^\n\r+]+[\n\r]+', 'match', 'split', 'once'); %identify line with '==== DATA'. Keep everything up to that and the next line as header. After that it's data
filedata{fileidx, 1} = header;
filedata{fileidx, 2} = data{2}; %data{1} is text before the regexp match which will always be empty since we match from the start
end
filenames = char({filelist.name});
filenames = filenames(:, 1:25);
filedata = cell2table([mat2cell(filenames, ones(1, numel(filelist)), [7, 16, 2]), filedata], ...
'VariableNames', {'Sensor_type', 'UID', 'Sensor_position', 'Header', 'Data'});
merged = rowfun(@(header, data) {[header{1}, data{:}]}, filedata, 'InputVariables', {'Header', 'Data'}, 'GroupingVariables', 'Sensor_position', 'OutputVariableNames', 'Content');
The above simply concatenates horizontally the header of the first file with the data of each file with the same Sensor_position. It doesn't add any newline between each, relying on the fact that your data already ends with a newline (at least in the sample file you shared). At the end you get a new table with the unique Sensor_position and the merged data. To actually write it to files you can use a loop (or another rowfun):
destfolder = '\\somewhere\somefolder'
for row = 1:height(merged)
fid = fopen(fullfile(destfolder, sprintf('%s.csv', merged.Sensor_position{row})), 'w'); %no idea what the output file name should be. Just using sensor_position here
fwrite(fid, merged.Content{row});
fclose(fid);
end
Code is completely untested. There may be typos, bugs.
edit: fixed many typos

9 Comments

Hi Guillaume,
Thanks for the quick response, any ideas why the rowfun code is throwing up so many errors? All of the Name-Value pairs apparently aren't recognised, the script appears to think that each one should be terminated with a semicolon and it also doesn't like the function at the start. There is some formatting issue perhaps going on, but I cannot see where it lies (none of the Name-Value pairs are misspelt, all of the brackets are balanced).
EDIT: Removed a comma between (header,data) and [header{1}, data{:}]. Whilst I can now run the script from the start, there is an error still with the inconsistency of the matrix sizes that are being concatenated
As fileread(fullfile(folder, file(fileidx.name))) doesn't work properly replaced it with fileread(fullfile(folder, filelist(fileidx).name)) so it can find the files properly.
Yes, there was a comma that sneaked in there. There may still be more typos to find.
As I'm having trouble concatenating the Header with the Data, despite the func [header{1}, data{:}] working fine, could it be due to the fact that it is only one header but multiple Data rows which the script is stumbling over? I find it strange that it would be able to concatenate data which is a 1x2 cell with header which is 1x1.
data should be a Nx1 cell array of char vectors in the anonymous function, where N is the number of rows with the same sensor_position. If it is not, I've done something wrong.
What is
class(filedata.Data{1})
?
Note that [header{1}, data{:}] is the same as [header{1}, data{1}, data{2}, ..., data{end}], so there should be no issue with the concatenation.
It is char.
Data is 1x2, so it should be 11x1 where 11 is the number of rows with the same sensor position? However the code means that data only holds one file at a time as it iterates through the list of files from start to finish.
data =
2×1 cell array
''
'2019-04-24_16:52:47.697, 1556124767697,
I thought not but best to check just in case. It is line 27 which is causing the problem but I don't exactly know which variable is not the right size.
The 2x1 data is the one in the for loop. That's not the same data in the rowfun. The problem was actually with header, as I'd forgotten the 'once' option to the regexp. Plus, there were a few extra brackets here and there.
I've tried to make up some data and tested the code now, so it should work properly. At least, there shouldn't be any typo left.
header has changed size and does incorporate other headers now into it, however I still have the same issues. Perhaps if I attached a set of files it would be more instructive as to how to troubleshoot. All the tests in the attached zip file are for 2 sensor positions. If you could try it out with those files I would be very grateful.
You didn't say there were still some typos in the filecontent = ... line. There was also a typo in the file name separation which resulted in an error in the mat2cell call.
And finally, I forgot to make the result of the merging a cell array in rowfun so it could be concatened into a column table.
Having data to test against really help. All bugs are now squashed.
Yes I fixed the filecontent and mat2cell error but forgot to mention it. Thanks, it works now.

Sign in to comment.

More Answers (0)

Asked:

on 25 Apr 2019

Commented:

on 27 Apr 2019

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!