Extract data from text file

3 views (last 30 days)
matlab noob
matlab noob on 29 Apr 2019
Edited: Stephen23 on 1 May 2019
I have this 'sample data.txt' text file with the data not in the right form. I need to read this text file and extract the data and tabulate it in the order as shown in figure below. I am not sure how can I do it.
Really appreciate it if someone can help to guide me. Thank you.
  2 Comments
Guillaume
Guillaume on 29 Apr 2019
The format of your text file is dreadful! Has it been altered in any way from its original format? It would be much easier to parse if the column data was separated by a tab or comma character instead of spaces and if the table header wasn't split onto two lines within one of the column header.
The screenshot that you show doesn't match the text file you've attached and therefore leave some questions unanswered:
  • It would appear that the Delayed Gadolinium Enhancement column can have multiword entries (e.g. Full thickness). Can any other column also have multi word entries? If so, how can we identify which column a word belongs to?
  • The formatting of the text is not even consistent across the table. Sometimes you have < 50 (with a space), sometimes <50 (without a space) for that last column. Do you want the text as is, or normalised in the output? Even better in my opinion would be to convert to numbers, in that case should Full thickness be converted to 100?
Unfortunately, because of that awful formatting, you're going to have to write a parser for the file and make plenty of assumptions that may be invalid and cause the parsing to fail on future files. If you can get the same data in a more sensible format that would be better.
matlab noob
matlab noob on 29 Apr 2019
Really appreciate for your reply regarding this question. According to the question you've asked
  • It would appear that the Delayed Gadolinium Enhancement column can have multiword entries (e.g. Full thickness). Can any other column also have multi word entries? If so, how can we identify which column a word belongs to? The other column can also have multi word entris
  • The formatting of the text is not even consistent across the table. Sometimes you have < 50 (with a space), sometimes <50 (without a space) for that last column. Do you want the text as is, or normalised in the output? I'll need the original text as it is, no conversion is encourage in my case.
Meanwhile I'm searching something that can read specific string inbetween those data that I'll like to extract out. Is it possible for this idea to apply for this case?
Thank you.

Sign in to comment.

Accepted Answer

Stephen23
Stephen23 on 29 Apr 2019
Edited: Stephen23 on 29 Apr 2019
That is a very badly formatted file. For example, the field delimiters are space characters and space characters also occur within the fields (without any text delimiters to group the fields together). There is no robust general solution for parsing such a poorly formatted file, altough in some limited cases (such as with prior knowledge of the field contents) you might be able to parse it but parsing such files will always be fragile. On that basis I assumed that the fields contain only the text in the number and types that you have shown, i.e. each line contains exactly:
  1. 1 or 2 words (starts with 'Basal' or 'Mid' or 'Apical', or constitutes 'Apex')
  2. 1 number
  3. 1 word
  4. ('Nil' or 'Present')
  5. ('Nil' or 'Present')
  6. ('Nil' or 'Full thickness' or a percentage)
This matches all of the seventeen rows in your example data file:
str = fileread('sample data.txt');
rgx = ['(Apex|(Basal|Mid|Apical)\s+[A-Z][a-z]+)\s+(\d+)\s+([A-Z][a-z]+)',...
'\s+(Nil|Present)\s+(Nil|Present)\s+(Nil|Full thickness|([<>]\s?)?\d+\%)'];
tkn = regexpi(str,rgx,'tokens');
tkn = vertcat(tkn{:})
Giving:
tkn =
'Basal Anterior' '1' 'Hypokinetic' 'Nil' 'Nil' '50%'
'Basal Anteroseptal' '2' 'Dyskinetic' 'Present' 'Present' 'Full thickness'
'Basal Inferoseptal' '3' 'Hypokinetic' 'Present' 'Present' '50%'
'Basal Inferior' '4' 'Hypokinetic' 'Nil' 'Present' '50%'
'Basal Inferolateral' '5' 'Normal' 'Nil' 'Nil' 'Nil'
'Basal Anterolateral' '6' 'Normal' 'Nil' 'Nil' 'Nil'
'Mid Anterior' '7' 'Hypokinetic' 'Nil' 'Nil' '<50%'
'Mid Anteroseptal' '8' 'Dyskinetic' 'Present' 'Present' 'Full thickness'
'Mid Inferoseptal' '9' 'Akinetic' 'Present' 'Present' 'Full thickness'
'Mid Inferior' '10' 'Hypokinetic' 'Nil' 'Present' '<50%'
'Mid Inferolateral' '11' 'Normal' 'Nil' 'Nil' 'Nil'
'Mid Anterolateral' '12' 'Normal' 'Nil' 'Nil' '<50%'
'Apical Anterior' '13' 'Akinetic' 'Nil' 'Nil' '50%'
'Apical Septal' '14' 'Akinetic' 'Nil' 'Nil' '< 50%'
'Apical Inferior' '15' 'Akinetic' 'Nil' 'Nil' '> 50%'
'Apical Lateral' '16' 'Hypokinetic' 'Nil' 'Nil' 'Full thickness'
'Apex' '17' 'Akinetic' 'Nil' 'Nil' 'Full thickness'
>> size(tkn)
ans =
17 6
>>
Clearly you can put that into a table if you really want to:
>> hdr = {'LeftVentricularSegments','No','WallMotion','PerfusionAtRest','PerfusionAtStress','DelayedGadoliniumEnhancement'};
>> T = cell2table(tkn,'VariableNames',hdr)
T =
LeftVentricularSegments No WallMotion PerfusionAtRest PerfusionAtStress DelayedGadoliniumEnhancement
_______________________ ____ _____________ _______________ _________________ ____________________________
'Basal Anterior' '1' 'Hypokinetic' 'Nil' 'Nil' '50%'
'Basal Anteroseptal' '2' 'Dyskinetic' 'Present' 'Present' 'Full thickness'
'Basal Inferoseptal' '3' 'Hypokinetic' 'Present' 'Present' '50%'
'Basal Inferior' '4' 'Hypokinetic' 'Nil' 'Present' '50%'
'Basal Inferolateral' '5' 'Normal' 'Nil' 'Nil' 'Nil'
'Basal Anterolateral' '6' 'Normal' 'Nil' 'Nil' 'Nil'
'Mid Anterior' '7' 'Hypokinetic' 'Nil' 'Nil' '<50%'
'Mid Anteroseptal' '8' 'Dyskinetic' 'Present' 'Present' 'Full thickness'
'Mid Inferoseptal' '9' 'Akinetic' 'Present' 'Present' 'Full thickness'
'Mid Inferior' '10' 'Hypokinetic' 'Nil' 'Present' '<50%'
'Mid Inferolateral' '11' 'Normal' 'Nil' 'Nil' 'Nil'
'Mid Anterolateral' '12' 'Normal' 'Nil' 'Nil' '<50%'
'Apical Anterior' '13' 'Akinetic' 'Nil' 'Nil' '50%'
'Apical Septal' '14' 'Akinetic' 'Nil' 'Nil' '< 50%'
'Apical Inferior' '15' 'Akinetic' 'Nil' 'Nil' '> 50%'
'Apical Lateral' '16' 'Hypokinetic' 'Nil' 'Nil' 'Full thickness'
'Apex' '17' 'Akinetic' 'Nil' 'Nil' 'Full thickness'
  12 Comments
Stephen23
Stephen23 on 1 May 2019
Edited: Stephen23 on 1 May 2019
This code reads your three later files, where each field is on its own line.
The code relies on one main assumption: that the header name "No" appears by itself on one line, which is used to anchor and identify the block of data that you are looking for. The other lines are simply contiguous with that header name. It also uses the "No" field values to identify the number of fields: this requires that only the "No" fields constitute numeric values.
R = '([^\n]+\n)*No(\n[^\n]+)+'; % regular expression, contiguous around "No".
S = dir('sample*.txt');
N = numel(S);
C = cell(1,N);
for k = 1:N
str = fileread(S(k).name);
str = regexprep(str,'\r\n','\n'); % replace Windows newlines.
M = regexp(str,R,'match','once'); % match lines of text file.
P = regexp(M,'\n','split'); % split lines into cell array.
V = str2double(P); % convert lines into numbers.
D = mean(diff(find(~isnan(V)))); % identify non-NaN (i.e. "No" lines").
H = regexprep(P(1:D),'\s+','_'); % get heater lines.
X = strcmpi(P{D+1},'Enhancement'); % identify superfluous header.
A = reshape(P(1+X+D:end),D,[]).'; % get data lines.
T = cell2table(A,'variableNames',H); % convert data + header into table.
C{k} = T;
end
Giving:
>> C{:}
ans =
Left_Ventricular_Segments No Perfusion_defect_at_rest Perfusion_defect_at_stress Wall_Motion Delayed_Gadolinium
_________________________ ____ ________________________ __________________________ _____________ __________________
'Basal Anterior' '1' 'Nil' 'Nil' 'Normal' 'Mid wall'
'Basal Anteroseptal' '2' 'Nil' 'Nil' 'Normal' 'Mid wall'
'Basal Inferoseptal' '3' 'Nil' 'Nil' 'Normal' 'Mid wall'
'Basal Inferior' '4' 'Nil' 'Nil' 'Normal' 'Nil'
'Basal Inferolateral' '5' 'Nil' 'Nil' 'Normal' 'Nil'
'Basal Anterolateral' '6' 'Nil' 'Nil' 'Normal' 'Nil'
'Mid Anterior' '7' 'Present' 'Present' 'Akinetic' 'Full thickness'
'Mid Anteroseptal' '8' 'Present' 'Present' 'Akinetic' 'Full thickness'
'Mid Inferoseptal' '9' 'Present' 'Present' 'Hypokinetic' 'Full thickness'
'Mid Inferior' '10' 'Nil' 'Nil' 'Normal' 'Nil'
'Mid Inferolateral' '11' 'Nil' 'Nil' 'Normal' 'Nil'
'Mid Anterolateral' '12' 'Nil' 'Nil' 'Normal' 'Nil'
'Apical Anterior' '13' 'Present' 'Present' 'Akinetic' 'Full thickness'
'Apical Septal' '14' 'Present' 'Present' 'Akinetic' 'Full thickness'
'Apical Inferior' '15' 'Present' 'Present' 'Akinetic' 'Full thickness'
'Apical Lateral' '16' 'Nil' 'Nil' 'Normal' '<50%'
'Apex' '17' 'Nil' 'Nil' 'Dystkinetic' '<50%'
ans =
Left_Ventricular_Segments No Wall_Motion Perfusion_At_Rest Perfusion_At_Stress Delayed_Gadolinium
_________________________ ____ ___________ _________________ ___________________ __________________
'Basal Anterior' '1' 'Normal' 'Nil' 'Nil' 'Nil'
'Basal Anteroseptal' '2' 'Normal' 'Nil' 'Nil' 'Nil'
'Basal Inferoseptal' '3' 'Normal' 'Nil' 'Nil' 'Nil'
'Basal Inferior' '4' 'Normal' 'Nil' 'Nil' '50% (mid wall)'
'Basal Inferolateral' '5' 'Normal' 'Nil' 'Nil' 'Nil'
'Basal Anterolateral' '6' 'Normal' 'Nil' 'Nil' 'Nil'
'Mid Anterior' '7' 'Normal' 'Nil' 'Nil' 'Nil'
'Mid Anteroseptal' '8' 'Normal' 'Nil' 'Nil' 'Nil'
'Mid Inferoseptal' '9' 'Normal' 'Nil' 'Nil' 'Nil'
'Mid Inferior' '10' 'Normal' 'Nil' 'Nil' 'Nil'
'Mid Inferolateral' '11' 'Normal' 'Nil' 'Nil' 'Nil'
'Mid Anterolateral' '12' 'Normal' 'Nil' 'Nil' 'Nil'
'Apical Anterior' '13' 'Normal' 'Nil' 'Nil' 'Nil'
'Apical Septal' '14' 'Normal' 'Nil' 'Nil' 'Nil'
'Apical Inferior' '15' 'Normal' 'Nil' 'Nil' 'Nil'
'Apical Lateral' '16' 'Normal' 'Nil' 'Nil' 'Nil'
'Apex' '17' 'Normal' 'Nil' 'Nil' 'Nil'
ans =
Left_Ventricular_Segments No Wall_Motion Perfusion_Defect_At_Stress Delayed_Gadolinium
_________________________ ____ _____________ __________________________ __________________
'Basal Anterior' '1' 'Hypokinetic' 'Nil' 'Nil'
'Basal Anteroseptal' '2' 'Hypokinetic' 'Nil' 'Mid wall'
'Basal Inferoseptal' '3' 'Hypokinetic' 'Nil' 'Mid wall'
'Basal Inferior' '4' 'Hypokinetic' 'Nil' 'Nil'
'Basal Inferolateral' '5' 'Hypokinetic' 'Nil' 'Nil'
'Basal Anterolateral' '6' 'Hypokinetic' 'Nil' 'Nil'
'Mid Anterior' '7' 'Hypokinetic' 'Nil' '50%'
'Mid Anteroseptal' '8' 'Hypokinetic' 'Nil' 'Mid wall'
'Mid Inferoseptal' '9' 'Hypokinetic' 'Nil' 'Mid wall'
'Mid Inferior' '10' 'Hypokinetic' 'Possibly' 'Nil'
'Mid Inferolateral' '11' 'Hypokinetic' 'Nil' 'Nil'
'Mid Anterolateral' '12' 'Hypokinetic' 'Nil' 'Nil'
'Apical Anterior' '13' 'Akinetic' 'Nil' '50%'
'Apical Septal' '14' 'Hypokinetic' 'Nil' '50%'
'Apical Inferior' '15' 'Hypokinetic' 'Nil' '50%'
'Apical Lateral' '16' 'Hypokinetic' 'Nil' '< 50%'
'Apex' '17' 'Dyskinetic' 'Nil' '50%'
>>
matlab noob
matlab noob on 1 May 2019
Thank you so much for your help and explaination. I think I'm able to understand your concept. However, I'll need some time to understand the code. Once agian thank you!

Sign in to comment.

More Answers (1)

KSSV
KSSV on 29 Apr 2019
T = readtable(myfile)
  2 Comments
Guillaume
Guillaume on 29 Apr 2019
There is no way that readtable can cope with the sample file supplied by the OP.
matlab noob
matlab noob on 29 Apr 2019
Appreciate for the reply. Thanks!

Sign in to comment.

Categories

Find more on Text Data Preparation in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!