Searching for closely related filenames in a directory

I'm working on some efficiency based code which essentially will check if a filename exists, and load the file in if it does. I use structure arrays to store my parameters, so a simplified version of a string looks something like this (more variables, but enough here to demonstrate the point while minimising complexity I hope), with file search logic afterwards:
%String/Filename To Check For:
opts.str.f1 = ['D:\Filepath\M_3_202020_signals_vol_X_']; %Assume this bit is fixed for simplicity.
opts.str.f2 = ['32_Y_32_Z_32_Omega_260.567_no_fields_',num2str(opts.no_fields),'and_25_sensors.mat'];
%Search for the file:
if isfield(For_opts,'sensMstr') && isfile([opts.str.f1,opts.str.f2]); load([opts.str.f1,opts.str.f2]) %Load File
else; %Do Something Else
end
The issue is that sometimes this function is called with a value of "opts.no_fields" <= the value in my savefile.I have logic elsewhere which manages the data loaded in from these files, I'm just seeking a way of loading in an existing file. For example I can manually load in the file with opt.no_fields = 10, set opt.no_fields = 5 and run the rest of my program to get the desired results.
What I would like to do in the loading section is search for the filename and find a match if opts.no_fields is <= the value stored after the 'no_fields_' string, and all the other variables in the filename match. My best hacky approach is:
opts.str.f1 = ['D:\Filepath\M_3_202020_signals_vol_X_']; %Assume this bit is fixed for simplicity.
max_var = 20; %Pass Additional Variable
%Filenames currently all contain the same number of fields, but I want to work with less so this hack works for now.
%This will change in the very near future, hence the need to write some search code..
if opts.no_fields <= max_var; opts.str.f2 = ['32_Y_32_Z_32_Omega_260.567_no_fields_',num2str(max_var),'and_25_sensors.mat'];
else; opts.str.f2 = ['32_Y_32_Z_32_Omega_260.567_no_fields_',num2str(opts.no_fields),'and_25_sensors.mat'];
end
%Search for the file:
if isfield(For_opts,'sensMstr') && isfile([opts.str.f1,opts.str.f2]); load([opts.str.f1,opts.str.f2]) %Load File
else; %Do Something Else
end
I think a combination of isfile and contains maybe able to achieve this, but I am unable to construct a logic which actually runs, mind achieves the intention. Any suggestions would be greatly appreciated.

4 Comments

Is there a reason you are not just using regex (or even, strfind) on your filenames and getting the no_fields? I'm slightly confused, perhaps show a simpler example removed from your application.
I had read the documentation for both functions but couldn't figure out how to implement them for this problem.
If I'm honest I'm new to saving/loading in data and I'm just trying my best to construct something that works as I go along.
str = '32_Y_32_Z_32_Omega_260.567_no_fields_360_and_25_sensors.mat';
out = regexpi(str,'fields_(fields|\d*)_and_(sensors|\d*)','tokens');
out{1}{1}
out{1}{2}
Results in
ans =
'360'
ans =
'25'
This would be my accepted answer, but is posted as comment unfortunetly.

Sign in to comment.

Answers (2)

Jan
Jan on 17 Dec 2020
Edited: Jan on 17 Dec 2020
['32_Y_32_Z_32_Omega_260.567_no_fields_',num2str(opts.no_fields),'and_25_sensors.mat']
Searching such files is complicated, because important information is hidden in the name of the file. This is an inefficient design, because it impedes the addressing. You can compare this with including the phone numbers and the current wight in the name of persons. This makes it much easier to call somebody, if you know his name, but all persons change their names every morning and it will be horrible to update all data bases.
Store data in a data base, because this is the purpose. Either use a professional data base and find the data by SQL queries, or create your own database: Store files with dull names like "file1.mat", "file2.mat", ... and write the important information, which data belong to which parameters either in an extra field inside the file and/or to an extra file. Having this information only in an extra file is fragile, because the correlation between the parameters and the data might be lost, when somebody renames a file. Storing the parameters inside each file only requires to open a lot of files to find a specific dataset. So it is stable, to store the parameters inside the files and copy them to an extra file for a faster searching. Then a lost correlation can be restored by collecting the parameters once agin from the actual data files.
Instead of inventing a smart and flexible method to search file names, use a clear and clean approach to store parameters in a format, which can be searched efficiently.
Using the file names to store parameters causes another problem also, because the Windows Explorer does not handle file names with more than 256 characters including the path. Moving the files or folders in the Windows Explorer fails and you get an error message if you are lucky.

9 Comments

I appreciate the insight.
I'm working on a physics PhD with big data and limited computating resources, so this approach is essentially a way the minimise repeated experiments. Each file contains a PDE results solution from the PDE toolbox, which is ~18GB in size. These are also relatively fragile and frequently updated as components of the physical model (not relevent to the str above) change in the FEM model e.g geometry, boundary conditions etc.
All the variables are stored in that structure, but simply loading the 18GB results component into memory so I can check the parameters takes way too long, hence looking for a useful approach which works for the above purpose purely. It might be possible, but I haven't seen a way I can take the results structure and build a database with 1000's of GB of data in there, whilst being stuck with a 1TB HDD and relying on smart cloud storage to actually access all this data in the first place.
Honestly if you can guide me to some useful resources which would help me construct something simpler/more effective I would greatly appreciate it, I'm in a position of I know I'm not following normal practice, but I don't know any of the terminology associated with the programming side of what I'm doing so I can't find any resources to get me moving in the right direction.
I'm not opposed to placing meta data in a filename. It keeps data with data, describes what's inside, and should your master database crash or become corrupted, you might be able to salvage your project.
Best practice? If you're staying in MATLAB, you could just create a table with a few columns for no_fields, sensors, and also filepath and filename. Save it to a MAT-file, then open and append rows when you add files to your project, then re-save. You'll want to back this up regularly.
As suggested below, looking for string similarity is easy, I'll also suggest editDistance.
Oh, very nice! 😎
I thought they had a built in function for the Levenshtein distance that I mentioned in my answer below but I couldn't remember the function name. And, unfortunately Levenshtein doesn't come up in my help search. Neither does editDistance() - I guess it's not a built-in function. ☹️ Looks like it's in the Text Analytics Toolbox.
@Image Analyst: There must be a built-in function, because Matlab suggests alternatives for typos in the command window.
Jan, it's built-in but built in to some specific toolbox (the Text Analytics Toolbox) I believe:
>> editDistance
Unrecognized function or variable 'editDistance'.
>> doc editDistance
No results for editDistance in MathWorks and Supplemental Software documentation.
Did you mean: getdistance (3 results in MathWorks documentation).
Search tips:
  1. Check the spelling.
  2. The search may be too vague—use more specific search terms.
  3. The search may be too specific—use more general search terms.
  4. If you used quotes for an exact match, try removing them for a broad match.
@Image Analyst: If I type "help ploz", Matlab suggests the help text of plot(). I stepped through the code of help() in the debugger and found:
candidates = {'bsd', 'Ass', 'csd', 'sad'};
match = matlab.internal.language.errorrecovery.namesuggestion('asd', candidates)
Yes, but nevertheless, I don't have the Text Analytics Toolbox so not even the suggestions for a misspelled editdistance() show up because of that.
I do not have the Text Analytics Toolbox also, but actually I meant, that the built-in function matlab.internal.language.errorrecovery.namesuggestion can be used - not to calculate the edit- or Levenshtein-distance, but to find the nearest matching string.

Sign in to comment.

Categories

Find more on Creating, Deleting, and Querying Graphics Objects in Help Center and File Exchange

Products

Release

R2020b

Asked:

on 16 Dec 2020

Commented:

on 11 Jan 2021

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!