How to ignore special characters and retrieve the data prior to the character

15 views (last 30 days)
I have 40 years of data. Unfortunately, each text file has special characters # or * in them representing the highest or lowest temperatures of that specific day and month. My code works (outside regexp(minT_tbl,'#*','match') and its counterpart). However, the special characters is confusing the program making data wrong. Any help would be great!
close all;
clear all;
clc;
Datafiles = fileDatastore("temp_summary*.txt","ReadFcn",@readMonth,"UniformRead",true);
dataAll = readall(Datafiles)
dataAll.Year = year(dataAll.Day);
dataAll.Month = month(dataAll.Day);
dataAll.DD = day(dataAll.Day)
%delete leap year
LY = (dataAll.Month(:)==2 & dataAll.DD(:)==29);
dataAll(LY,:) = [];
% Unstack variables
minT_tbl = unstack(dataAll,"MinT","Year","GroupingVariables", ["Month","DD"],"VariableNamingRule","preserve")
maxT_tbl = unstack(dataAll,"MaxT","Year","GroupingVariables", ["Month","DD"],"VariableNamingRule","preserve")
yrs =str2double(minT_tbl.Properties.VariableNames(3:end))';
%ignore special characters
regexp(minT_tbl,'#*','match')
regexp(maxT_tbl,'#*','match')
% find min
[Tmin,idxMn] = min(minT_tbl{:,3:end},[],2,'omitnan');
Tmin_yr = yrs(idxMn);
% find max
[Tmax,idxMx] = max(maxT_tbl{:,3:end},[],2,'omitnan');
Tmax_yr = yrs(idxMx);
% find low high
[lowTMax,idxMx] = min(maxT_tbl{:,3:end},[],2,'omitnan');
LowTMax_yr = yrs(idxMx);
% find high low
[highlowTMn,idxMn] = max(minT_tbl{:,3:end},[],2,'omitnan');
HighLowT_yr = yrs(idxMn);
% find avg high
AvgTMx = round(mean(table2array(maxT_tbl(:,3:end)),2,'omitnan'));
% find avg low
AvgTMn = round(mean(table2array(minT_tbl(:,3:end)),2,'omitnan'));
% Results
tempTbl = [maxT_tbl(:,["Month","DD"]), table(Tmax,Tmax_yr,AvgTMx,lowTMax,LowTMax_yr,Tmin,Tmin_yr,AvgTMn,highlowTMn,HighLowT_yr)]
tempTbl2 = splitvars(tempTbl)
FID = fopen('Meda 05 Temperature Climatology.txt','w');
report_date = datetime('now','format','yyyy-MM-dd HH:MM');
fprintf(FID,'Meda 05 Temperature Climatology at %s \n', report_date);
fprintf(FID,"Month DD Temp Max (°F) Tmax_yr AvgTMax (°F) lowTMax (°F) LowTMax_yr TempMin (°F) TMin_yr AvgTMin (°F) HighlowTMin (°F) HighlowT_yr \n");
fprintf(FID,'%3d %6d %7d %14d %11d %11d %15d %11d %13d %10d %13d %17d \n', tempTbl2{:,1:end}');
fclose(FID);
winopen('Meda 05 Temperature Climatology.txt')
function Tbl = readMonth(filename)
opts = detectImportOptions(filename)
opts.ConsecutiveDelimitersRule = 'join';
opts.MissingRule = 'omitvar';
opts = setvartype(opts,'double');
opts.VariableNames = ["Day","MaxT","MinT","AvgT"];
Tbl = readtable(filename,opts);
Tbl = standardizeMissing(Tbl,{999,'N/A'},"DataVariables",{'MaxT','MinT','AvgT'})
Tbl = standardizeMissing(Tbl,{-99,'N/A'},"DataVariables",{'MaxT','MinT','AvgT'})
[~,basename] = fileparts(filename);
nameparts = regexp(basename, '\.', 'split');
dateparts = regexp(nameparts{end}, '_','split');
year_str = dateparts{end}
d = str2double(extract(filename,digitsPattern));
Tbl.Day = datetime(d(3),d(2),Tbl.Day)
end
  6 Comments
Cris LaPierre
Cris LaPierre on 7 Feb 2024
Test it out. It doesn't elminate them because month does not equal 2 anymore, and day does not equal 29. They are now 3 and 1.
dataAll = table();
dataAll.Day = datetime(1981,2,29) % Feb 29, 1981, which is a non-leap year
dataAll = table
Day ___________ 01-Mar-1981
dataAll.Month = month(dataAll.Day);
dataAll.DD = day(dataAll.Day)
dataAll = 1×3 table
Day Month DD ___________ _____ __ 01-Mar-1981 3 1
% Remove all Feb 29 dates from the table
LY = (dataAll.Month(:)== 2 & dataAll.DD(:) == 29);
dataAll(LY,:) = [ ]
dataAll = 1×3 table
Day Month DD ___________ _____ __ 01-Mar-1981 3 1
As you can see, the current LY code did not remove the data.
Jonathon Klepatzki
Jonathon Klepatzki on 7 Feb 2024
Well I am stuck then. Because I just tried it different ways and I continue to get the same result.

Sign in to comment.

Accepted Answer

Voss
Voss on 6 Feb 2024
Edited: Voss on 6 Feb 2024
The following code replaces any * or # characters in a text file with spaces (note that this replaces the existing file with a new file of the same name):
% read the file
fid = fopen(filename,'r');
str = fread(fid,[1 Inf],'*char');
fclose(fid);
% replace any * or # with a space (empty char vector should also work)
str = regexprep(str,'[*#]',' ');
% write the new file
fid = fopen(filename,'w');
fwrite(fid,str);
fclose(fid);
If you don't mind losing the original files that have the * and/or # characters in them, you can run this code for each of your text files before running your code or you can incorporate this code into your readMonth function.
If you want to preserve the original files, make a separate copy of them first, or modify the above code to write to a different file, e.g.:
% write the new file
[fp,fn,ext] = fileparts(filename);
fid = fopen(fullfile(fp,[fn '_modified' ext]),'w');
fwrite(fid,str);
fclose(fid);
and tell fileDatastore to use the modified files only, e.g.:
Datafiles = fileDatastore("temp_summary*_modified.txt","ReadFcn",@readMonth,"UniformRead",true);
  20 Comments
Star Strider
Star Strider on 13 Feb 2024
I am lost. My code seems to work correctly when I run it, without any other modifications to it or to the tables or files it creates.
Mentioning me using ‘@’ flags me and I look to see what I need to attend to, if anything, since sometimes it’s just a reference.
Cris LaPierre
Cris LaPierre on 13 Feb 2024
@Jonathon Klepatzki, you can specify the NumHeaderLines, VariableNamesLine, VariableUnitsLine, VariableDescriptionsLine, and the DataLines import arguments to correctly import a file that has non-data lines between the variable names and data.
However, where you are using a datastore to import your files, the same import options are used to read in all files. Therefore, all files must be formattted the same or you will get errors like the one you saw.

Sign in to comment.

More Answers (3)

Sulaymon Eshkabilov
Sulaymon Eshkabilov on 6 Feb 2024
Here is one possible solution, to get the data correctly from the data file:
% Open the data file for reading
FID = fopen('temp_summary.05.03_1998.txt', 'r');
% Initialize a cell array to store the cleaned data
C_Lines = {};
% Read the file line by line
N_line = fgetl(FID);
while ischar(N_line)
% Remove '*' and '#' characters from the line
C_Line = strrep(N_line, '*', '');
C_Line = strrep(C_Line, '#', '');
% Store the cleaned line if it is not empty
if ~isempty(C_Line)
C_Lines{end+1} = C_Line;
end
% Read the next line
N_line = fgetl(FID);
end
% Close the file:
fclose(FID);
% Convert the cell array of cleaned lines to a character array:
C_Data = char(C_Lines)
C_Data = 32×49 char array
' Day Maximum Temp Minimum Temp Average Temp' ' 01 66 28 47.0 ' ' 02 65 29 47.0 ' ' 03 62 36 49.0 ' ' 04 63 31 47.0 ' ' 05 52 36 44.0 ' ' 06 53 28 40.5 ' ' 07 62 26 44.0 ' ' 08 65 27 46.0 ' ' 09 69 27 48.0 ' ' 10 76 28 52.0 ' ' 11 74 29 51.5 ' ' 12 62 44 53.0 ' ' 13 65 43 54.0 ' ' 14 75 32 53.5 ' ' 15 73 35 54.0 ' ' 16 73 34 53.5 ' ' 17 64 37 50.5 ' ' 18 69 27 48.0 ' ' 19 74 34 54.0 ' ' 20 77 31 54.0 ' ' 21 76 36 56.0 ' ' 22 83 37 60.0 ' ' 23 82 50 66.0 ' ' 24 64 49 56.5 ' ' 25 60 43 51.5 ' ' 26 54 47 50.5 ' ' 27 52 34 43.0 ' ' 28 51 34 42.5 ' ' 29 60 29 44.5 ' ' 30 57 31 44.0 ' ' 31 50 32 41.0 '

Cris LaPierre
Cris LaPierre on 7 Feb 2024
Edited: Cris LaPierre on 8 Feb 2024
I think another rather straightforward approach is to treat * and # as delmiters.
I've simplified the read function for readability
Datafiles = fileDatastore("temp_summary*.txt","ReadFcn",@readMonth,"UniformRead",true);
dataAll = readall(Datafiles)
dataAll = 93×4 table
Day MaxT MinT AvgT ___ ____ ____ ____ 1 66 28 47 2 65 29 47 3 62 36 49 4 63 31 47 5 52 36 44 6 53 28 40.5 7 62 26 44 8 65 27 46 9 69 27 48 10 76 28 52 11 74 29 51.5 12 62 44 53 13 65 43 54 14 75 32 53.5 15 73 35 54 16 73 34 53.5
function Tbl = readMonth(filename)
Tbl = readtable(filename,"ConsecutiveDelimitersRule","join","ReadVariableNames",false,...
"Delimiter",{' ','\t','*','#'},"LeadingDelimitersRule",'ignore',...
'EmptyLineRule','skip');
Tbl.Properties.VariableNames = {'Day' 'MaxT' 'MinT' 'AvgT'};
end
  5 Comments
Cris LaPierre
Cris LaPierre on 7 Feb 2024
Moved: Cris LaPierre on 8 Feb 2024
Hmm. Works here. Have you shared the full error message (all the red text)?
Datafiles = fileDatastore("temp_summary*.txt","ReadFcn",@readMonth,"UniformRead",true);
dataAll = readall(Datafiles)
dataAll = 124×4 table
Day MaxT MinT AvgT ___ ____ ____ ____ 1 65 12 38.5 2 68 28 48 3 65 17 41 4 57 22 39.5 5 46 24 35 6 61 18 39.5 7 62 25 43.5 8 58 12 35 9 64 11 37.5 10 65 14 39.5 11 54 22 38 12 58 40 49 13 64 27 45.5 14 65 19 42 15 59 19 39 16 62 23 42.5
function Tbl = readMonth(filename)
Tbl = readtable(filename,"ConsecutiveDelimitersRule","join","ReadVariableNames",false,...
"Delimiter",{' ','\t','*','#'},"LeadingDelimitersRule",'ignore',...
'EmptyLineRule','skip');
Tbl.Properties.VariableNames = {'Day' 'MaxT' 'MinT' 'AvgT'};
end

Sign in to comment.


Walter Roberson
Walter Roberson on 8 Feb 2024
To answer the original question:
An alternative way to read the files is to use FixedWidthImportOptions together with readtable() https://www.mathworks.com/help/matlab/ref/matlab.io.text.fixedwidthimportoptions.html

Categories

Find more on Dates and Time in Help Center and File Exchange

Products


Release

R2020b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!