Remove intermittent text when reading in a table from a .dat file

2 views (last 30 days)
Hi,
I am trying to use readtable for read in a .dat file. The file looks like this, where there could be 1 to very many entries in the columns that start with a "1'" here.
# NetMHCIIpan version 4.0
# Input is in PEPTIDE format
# Prediction Mode: EL+BA
# Threshold for Strong binding peptides (%Rank) 2%
# Threshold for Weak binding peptides (%Rank) 10%
# Allele: HLA-DPA10103-DPB10101
--------------------------------------------------------------------------------------------------------------------------------------------
Pos MHC Peptide Of Core Core_Rel Identity Score_EL %Rank_EL Exp_Bind Score_BA Affinity(nM) %Rank_BA BindLevel
--------------------------------------------------------------------------------------------------------------------------------------------
1 HLA-DPA10103-DPB10101 AAAAAAAAAAAAAAA 3 AAAAAAAAA 0.380 Sequence 0.020745 81.44 NA 0.366182 951.24 32.45
--------------------------------------------------------------------------------------------------------------------------------------------
Number of strong binders: 2 Number of weak binders: 0
--------------------------------------------------------------------------------------------------------------------------------------------
# Allele: HLA-DPA10103-DPB10201
--------------------------------------------------------------------------------------------------------------------------------------------
Pos MHC Peptide Of Core Core_Rel Identity Score_EL %Rank_EL Exp_Bind Score_BA Affinity(nM) %Rank_BA BindLevel
--------------------------------------------------------------------------------------------------------------------------------------------
1 HLA-DPA10103-DPB10201 BBBBBBBBBBBBBBBB 2 BBBBBBBBB 0.960 Sequence 0.491911 1.02 NA 0.712020 22.55 0.27 <=SB
--------------------------------------------------------------------------------------------------------------------------------------------
Number of strong binders: 2 Number of weak binders: 0
--------------------------------------------------------------------------------------------------------------------------------------------
# Allele: HLA-DPA10103-DPB10202
--------------------------------------------------------------------------------------------------------------------------------------------
Pos MHC Peptide Of Core Core_Rel Identity Score_EL %Rank_EL Exp_Bind Score_BA Affinity(nM) %Rank_BA BindLevel
--------------------------------------------------------------------------------------------------------------------------------------------
1[.......]
These columns would then start 2,3,4,[...]. I successfully use
opts = detectImportOptions('filename.dat');
opts.DataLines = [16 Inf];
opts.VariableNamesLine = 14;
readtable(fullfile('path','filename.dat',opts,'ReadVariableNames', true);
for files with a large number of columns between the ----, i.e. e.g.
# Allele: HLA-DPA10103-DPB10101
--------------------------------------------------------------------------------------------------------------------------------------------
Pos MHC Peptide Of Core Core_Rel Identity Score_EL %Rank_EL Exp_Bind Score_BA Affinity(nM) %Rank_BA BindLevel
--------------------------------------------------------------------------------------------------------------------------------------------
1 HLA-DPA10103-DPB10101 AAAAAAAAAAAAAAA 3 AAAAAAAAA 0.380 Sequence 0.020745 81.44 NA 0.366182 951.24 32.45
2 HLA-....
3 ....
....
....
50 HLA....
--------------------------------------------------------------------------------------------------------------------------------------------
Number of strong binders: 2 Number of weak binders: 0
--------------------------------------------------------------------------------------------------------------------------------------------
However, this does not work for short "fillings" and my code very much depends on being robust in either scenario.
I tried playing with the opts but did not get it to work. I would be very grateful for any advice! Maybe a method other than readtable (readtext?) is needed and then a conversion to a table? In the end I will need a table like this:
Thank you very much for your advice! I have spent a long time deleoping the code around this and this is the final part that keeps breaking...

Accepted Answer

Vimal Rathod
Vimal Rathod on 22 Feb 2021
  1 Comment
L. Borealis
L. Borealis on 25 Feb 2021
Thanks, Vimal!
I had actually seen that question but even the question description was not particularly clear to me. So I had left it. Thanks for pointing me back to it. I did use part of it in the end to come up with a working (yet not elegant) solution. Maybe it is useful to someone in the future:
S = regexp(fileread('S:\scratch\cdr1pool2pep1\out_00.dat'), '\r?\n', 'split');
S = S(~cellfun('isempty',S));
if isempty(S{end}); S(end) = []; end %regexp split leaves empty at bottom if file ended in \n which is common
nonheader = cellfun(@isempty, regexp(S, '^\s*#|^\s*-|^\s*P|^\s*N' )); %permit space before #
starts = strfind([false nonheader], [false true]);
stops = strfind([nonheader false], [true false]);
num_blocks = length(starts);
lenRows = length(starts(1):stops(1));
S_temp = cell(num_blocks,lenRows);
for K = 1 : num_blocks
S_temp(K,:) = S(starts(K):stops(K));
end
S = reshape(S_temp,[num_blocks*lenRows,1]);
writecell(S,'data.dat')
opts = detectImportOptions('data.dat');
tbl=readtable(('data.dat'),opts);

Sign in to comment.

More Answers (0)

Categories

Find more on Cell Arrays in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!