- It would appear that the Delayed Gadolinium Enhancement column can have multiword entries (e.g. Full thickness). Can any other column also have multi word entries? If so, how can we identify which column a word belongs to?
- The formatting of the text is not even consistent across the table. Sometimes you have < 50 (with a space), sometimes <50 (without a space) for that last column. Do you want the text as is, or normalised in the output? Even better in my opinion would be to convert to numbers, in that case should Full thickness be converted to 100?
Extract data from text file
3 views (last 30 days)
Show older comments
I have this 'sample data.txt' text file with the data not in the right form. I need to read this text file and extract the data and tabulate it in the order as shown in figure below. I am not sure how can I do it.
Really appreciate it if someone can help to guide me. Thank you.
2 Comments
Guillaume
on 29 Apr 2019
The format of your text file is dreadful! Has it been altered in any way from its original format? It would be much easier to parse if the column data was separated by a tab or comma character instead of spaces and if the table header wasn't split onto two lines within one of the column header.
The screenshot that you show doesn't match the text file you've attached and therefore leave some questions unanswered:
Unfortunately, because of that awful formatting, you're going to have to write a parser for the file and make plenty of assumptions that may be invalid and cause the parsing to fail on future files. If you can get the same data in a more sensible format that would be better.
Accepted Answer
Stephen23
on 29 Apr 2019
Edited: Stephen23
on 29 Apr 2019
That is a very badly formatted file. For example, the field delimiters are space characters and space characters also occur within the fields (without any text delimiters to group the fields together). There is no robust general solution for parsing such a poorly formatted file, altough in some limited cases (such as with prior knowledge of the field contents) you might be able to parse it but parsing such files will always be fragile. On that basis I assumed that the fields contain only the text in the number and types that you have shown, i.e. each line contains exactly:
- 1 or 2 words (starts with 'Basal' or 'Mid' or 'Apical', or constitutes 'Apex')
- 1 number
- 1 word
- ('Nil' or 'Present')
- ('Nil' or 'Present')
- ('Nil' or 'Full thickness' or a percentage)
This matches all of the seventeen rows in your example data file:
str = fileread('sample data.txt');
rgx = ['(Apex|(Basal|Mid|Apical)\s+[A-Z][a-z]+)\s+(\d+)\s+([A-Z][a-z]+)',...
'\s+(Nil|Present)\s+(Nil|Present)\s+(Nil|Full thickness|([<>]\s?)?\d+\%)'];
tkn = regexpi(str,rgx,'tokens');
tkn = vertcat(tkn{:})
Giving:
tkn =
'Basal Anterior' '1' 'Hypokinetic' 'Nil' 'Nil' '50%'
'Basal Anteroseptal' '2' 'Dyskinetic' 'Present' 'Present' 'Full thickness'
'Basal Inferoseptal' '3' 'Hypokinetic' 'Present' 'Present' '50%'
'Basal Inferior' '4' 'Hypokinetic' 'Nil' 'Present' '50%'
'Basal Inferolateral' '5' 'Normal' 'Nil' 'Nil' 'Nil'
'Basal Anterolateral' '6' 'Normal' 'Nil' 'Nil' 'Nil'
'Mid Anterior' '7' 'Hypokinetic' 'Nil' 'Nil' '<50%'
'Mid Anteroseptal' '8' 'Dyskinetic' 'Present' 'Present' 'Full thickness'
'Mid Inferoseptal' '9' 'Akinetic' 'Present' 'Present' 'Full thickness'
'Mid Inferior' '10' 'Hypokinetic' 'Nil' 'Present' '<50%'
'Mid Inferolateral' '11' 'Normal' 'Nil' 'Nil' 'Nil'
'Mid Anterolateral' '12' 'Normal' 'Nil' 'Nil' '<50%'
'Apical Anterior' '13' 'Akinetic' 'Nil' 'Nil' '50%'
'Apical Septal' '14' 'Akinetic' 'Nil' 'Nil' '< 50%'
'Apical Inferior' '15' 'Akinetic' 'Nil' 'Nil' '> 50%'
'Apical Lateral' '16' 'Hypokinetic' 'Nil' 'Nil' 'Full thickness'
'Apex' '17' 'Akinetic' 'Nil' 'Nil' 'Full thickness'
>> size(tkn)
ans =
17 6
>>
Clearly you can put that into a table if you really want to:
>> hdr = {'LeftVentricularSegments','No','WallMotion','PerfusionAtRest','PerfusionAtStress','DelayedGadoliniumEnhancement'};
>> T = cell2table(tkn,'VariableNames',hdr)
T =
LeftVentricularSegments No WallMotion PerfusionAtRest PerfusionAtStress DelayedGadoliniumEnhancement
_______________________ ____ _____________ _______________ _________________ ____________________________
'Basal Anterior' '1' 'Hypokinetic' 'Nil' 'Nil' '50%'
'Basal Anteroseptal' '2' 'Dyskinetic' 'Present' 'Present' 'Full thickness'
'Basal Inferoseptal' '3' 'Hypokinetic' 'Present' 'Present' '50%'
'Basal Inferior' '4' 'Hypokinetic' 'Nil' 'Present' '50%'
'Basal Inferolateral' '5' 'Normal' 'Nil' 'Nil' 'Nil'
'Basal Anterolateral' '6' 'Normal' 'Nil' 'Nil' 'Nil'
'Mid Anterior' '7' 'Hypokinetic' 'Nil' 'Nil' '<50%'
'Mid Anteroseptal' '8' 'Dyskinetic' 'Present' 'Present' 'Full thickness'
'Mid Inferoseptal' '9' 'Akinetic' 'Present' 'Present' 'Full thickness'
'Mid Inferior' '10' 'Hypokinetic' 'Nil' 'Present' '<50%'
'Mid Inferolateral' '11' 'Normal' 'Nil' 'Nil' 'Nil'
'Mid Anterolateral' '12' 'Normal' 'Nil' 'Nil' '<50%'
'Apical Anterior' '13' 'Akinetic' 'Nil' 'Nil' '50%'
'Apical Septal' '14' 'Akinetic' 'Nil' 'Nil' '< 50%'
'Apical Inferior' '15' 'Akinetic' 'Nil' 'Nil' '> 50%'
'Apical Lateral' '16' 'Hypokinetic' 'Nil' 'Nil' 'Full thickness'
'Apex' '17' 'Akinetic' 'Nil' 'Nil' 'Full thickness'
12 Comments
Stephen23
on 1 May 2019
Edited: Stephen23
on 1 May 2019
This code reads your three later files, where each field is on its own line.
The code relies on one main assumption: that the header name "No" appears by itself on one line, which is used to anchor and identify the block of data that you are looking for. The other lines are simply contiguous with that header name. It also uses the "No" field values to identify the number of fields: this requires that only the "No" fields constitute numeric values.
R = '([^\n]+\n)*No(\n[^\n]+)+'; % regular expression, contiguous around "No".
S = dir('sample*.txt');
N = numel(S);
C = cell(1,N);
for k = 1:N
str = fileread(S(k).name);
str = regexprep(str,'\r\n','\n'); % replace Windows newlines.
M = regexp(str,R,'match','once'); % match lines of text file.
P = regexp(M,'\n','split'); % split lines into cell array.
V = str2double(P); % convert lines into numbers.
D = mean(diff(find(~isnan(V)))); % identify non-NaN (i.e. "No" lines").
H = regexprep(P(1:D),'\s+','_'); % get heater lines.
X = strcmpi(P{D+1},'Enhancement'); % identify superfluous header.
A = reshape(P(1+X+D:end),D,[]).'; % get data lines.
T = cell2table(A,'variableNames',H); % convert data + header into table.
C{k} = T;
end
Giving:
>> C{:}
ans =
Left_Ventricular_Segments No Perfusion_defect_at_rest Perfusion_defect_at_stress Wall_Motion Delayed_Gadolinium
_________________________ ____ ________________________ __________________________ _____________ __________________
'Basal Anterior' '1' 'Nil' 'Nil' 'Normal' 'Mid wall'
'Basal Anteroseptal' '2' 'Nil' 'Nil' 'Normal' 'Mid wall'
'Basal Inferoseptal' '3' 'Nil' 'Nil' 'Normal' 'Mid wall'
'Basal Inferior' '4' 'Nil' 'Nil' 'Normal' 'Nil'
'Basal Inferolateral' '5' 'Nil' 'Nil' 'Normal' 'Nil'
'Basal Anterolateral' '6' 'Nil' 'Nil' 'Normal' 'Nil'
'Mid Anterior' '7' 'Present' 'Present' 'Akinetic' 'Full thickness'
'Mid Anteroseptal' '8' 'Present' 'Present' 'Akinetic' 'Full thickness'
'Mid Inferoseptal' '9' 'Present' 'Present' 'Hypokinetic' 'Full thickness'
'Mid Inferior' '10' 'Nil' 'Nil' 'Normal' 'Nil'
'Mid Inferolateral' '11' 'Nil' 'Nil' 'Normal' 'Nil'
'Mid Anterolateral' '12' 'Nil' 'Nil' 'Normal' 'Nil'
'Apical Anterior' '13' 'Present' 'Present' 'Akinetic' 'Full thickness'
'Apical Septal' '14' 'Present' 'Present' 'Akinetic' 'Full thickness'
'Apical Inferior' '15' 'Present' 'Present' 'Akinetic' 'Full thickness'
'Apical Lateral' '16' 'Nil' 'Nil' 'Normal' '<50%'
'Apex' '17' 'Nil' 'Nil' 'Dystkinetic' '<50%'
ans =
Left_Ventricular_Segments No Wall_Motion Perfusion_At_Rest Perfusion_At_Stress Delayed_Gadolinium
_________________________ ____ ___________ _________________ ___________________ __________________
'Basal Anterior' '1' 'Normal' 'Nil' 'Nil' 'Nil'
'Basal Anteroseptal' '2' 'Normal' 'Nil' 'Nil' 'Nil'
'Basal Inferoseptal' '3' 'Normal' 'Nil' 'Nil' 'Nil'
'Basal Inferior' '4' 'Normal' 'Nil' 'Nil' '50% (mid wall)'
'Basal Inferolateral' '5' 'Normal' 'Nil' 'Nil' 'Nil'
'Basal Anterolateral' '6' 'Normal' 'Nil' 'Nil' 'Nil'
'Mid Anterior' '7' 'Normal' 'Nil' 'Nil' 'Nil'
'Mid Anteroseptal' '8' 'Normal' 'Nil' 'Nil' 'Nil'
'Mid Inferoseptal' '9' 'Normal' 'Nil' 'Nil' 'Nil'
'Mid Inferior' '10' 'Normal' 'Nil' 'Nil' 'Nil'
'Mid Inferolateral' '11' 'Normal' 'Nil' 'Nil' 'Nil'
'Mid Anterolateral' '12' 'Normal' 'Nil' 'Nil' 'Nil'
'Apical Anterior' '13' 'Normal' 'Nil' 'Nil' 'Nil'
'Apical Septal' '14' 'Normal' 'Nil' 'Nil' 'Nil'
'Apical Inferior' '15' 'Normal' 'Nil' 'Nil' 'Nil'
'Apical Lateral' '16' 'Normal' 'Nil' 'Nil' 'Nil'
'Apex' '17' 'Normal' 'Nil' 'Nil' 'Nil'
ans =
Left_Ventricular_Segments No Wall_Motion Perfusion_Defect_At_Stress Delayed_Gadolinium
_________________________ ____ _____________ __________________________ __________________
'Basal Anterior' '1' 'Hypokinetic' 'Nil' 'Nil'
'Basal Anteroseptal' '2' 'Hypokinetic' 'Nil' 'Mid wall'
'Basal Inferoseptal' '3' 'Hypokinetic' 'Nil' 'Mid wall'
'Basal Inferior' '4' 'Hypokinetic' 'Nil' 'Nil'
'Basal Inferolateral' '5' 'Hypokinetic' 'Nil' 'Nil'
'Basal Anterolateral' '6' 'Hypokinetic' 'Nil' 'Nil'
'Mid Anterior' '7' 'Hypokinetic' 'Nil' '50%'
'Mid Anteroseptal' '8' 'Hypokinetic' 'Nil' 'Mid wall'
'Mid Inferoseptal' '9' 'Hypokinetic' 'Nil' 'Mid wall'
'Mid Inferior' '10' 'Hypokinetic' 'Possibly' 'Nil'
'Mid Inferolateral' '11' 'Hypokinetic' 'Nil' 'Nil'
'Mid Anterolateral' '12' 'Hypokinetic' 'Nil' 'Nil'
'Apical Anterior' '13' 'Akinetic' 'Nil' '50%'
'Apical Septal' '14' 'Hypokinetic' 'Nil' '50%'
'Apical Inferior' '15' 'Hypokinetic' 'Nil' '50%'
'Apical Lateral' '16' 'Hypokinetic' 'Nil' '< 50%'
'Apex' '17' 'Dyskinetic' 'Nil' '50%'
>>
See Also
Categories
Find more on Text Data Preparation in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!