Read text file lines and analyze

1 view (last 30 days)
Lmm3
Lmm3 on 24 Jul 2017
Answered: OCDER on 9 Sep 2017
I would appreciate help with reading and analyzing a text file. The text file (rosalind_gc1.txt) is in this format:
>Rosalind_4949
ACTTCTATGTAGCGCGCTATTTCAAGGGATCGGCCAATAGTACGACGTGTTTCATCTAGT GCGACAAATGTATATACCGTTTTCATTACGTACCACGATAAGTTGAAGCCCGTATTC AGACGCGGGAGCCGTCTGCTGGACAAGTACTAGCTGGTCCATCCTCCCCACCAAAGGGAA
>Rosalind_7490
AACTGGGAATTTCTATATTGGGCGGTAAGCTCGGGGCAATCTATTAGTTGAATGCAACAG TAACAAACTTGCCGTCGGTCGCTGTTCGCGCAGCATTAATAATAACTCTGGCGAGTAGAT
>Rosalind_8337
CCTTGTTGTCTACCCACCAAGTCAGATAGACAGTTGGCTGTCTCCAACGCAGATTTTCTA CGCTTCATGCTCTTGCGACTCATGTCGCCTGGGTTTATTGCTTCTCTACGGGATAACCGC CCGGGCTCACTCTACCCGCGGGAAGGCCGCCCTCTCTCCCGTGTGCCTACATAA
I would like to determine the %GC for the data sets between each “>Rosalind” heading. For example, in the example above there are 3 data sets. The %GC for the text between “>Rosalind_4949” and “>Rosalind_7490” is 48.5876% and between “>Rosalind_7490” and “>Rosalind_8337” is 45.000%.
I’m trying to use the following code but I don’t know how to read the lines as blocks between each “>” and I don’t know how to concatenate the lines as I read them. I would appreciate any help.
fid = fopen('rosalind_gc1.txt');
while ~feof(fid)
templine = fgetl(fid);
a = strcmp(templine, '>');
if a == 0
G = length(strfind(templine,'G'));
C = length(strfind(templine,'C'));
z = length(templine);
%Per = (G+C)*100/z
end
end
Per = (G+C)*100/z

Accepted Answer

Lmm3
Lmm3 on 9 Sep 2017
The following code is what I used to read from the data file and determine %GC:
fid = fopen('rosalind_gc.txt');
n = 1;
G = 0;
C = 0;
z = 1;
while ~feof(fid)
templine = fgetl(fid);
a = strfind(templine, '>');
TF = isempty(a);
if TF == 1;
n= n+1;
G(1) = 0;
C(1) = 0;
z(1) = 0;
G(n) = length(strfind(templine,'G'));
C(n) = length(strfind(templine,'C'));
z(n) = length(templine);
G(n) = G(n) + G(n-1);
C(n) = C(n) + C(n-1);
z(n) = z(n) + z(n-1);
continue
% Per(n) = (G(n)+C(n))*100/z(n)
else TF == 0 ;
Per = (G(end)+C(end))*100/z(end)
disp(templine)
G(:,:) = [];
C(:,:) = [];
z (:,:)=[];
continue
end
end
Per =(G(end)+C(end))*100/z(end)

More Answers (2)

KSSV
KSSV on 24 Jul 2017
Edited: KSSV on 24 Jul 2017
Let data.txt be your text file...You can count the number of G in your file as below:
fid = fopen('data.txt') ;
S = textscan(fid,'%s','delimiter','\n') ;
fclose(fid) ;
S = S{1} ;
N = 0 ;
for i = 1:length(S)
N = N+length(strfind(S{i}, 'G'));
end
Without loop :
fid = fopen('data.txt') ;
S = textscan(fid,'%s','delimiter','\n') ;
fclose(fid) ;
S = S{1} ;
Ni = strfind(S,'G') ;
N = sum(cellfun(@numel,Ni)) ;
  1 Comment
Lmm3
Lmm3 on 25 Jul 2017
KSSV thank you for your response. Could you explain to me what the line S = S{1} is doing? The code returns the total number of "G" occurrences for the data file, but do you have a suggestion how to get the "G" occurrences between each of the headers that begin with ">Rosalind"? For example, in the data set above, I would like to get 3 values, the number of G occurrences between (“>Rosalind_4949” and “>Rosalind_7490”) between (“>Rosalind_7490” and “>Rosalind_8337”) and G occurrences below (">Rosalind_8337).

Sign in to comment.


OCDER
OCDER on 9 Sep 2017
If you deal with a lot of fasta files, look into fastaread (Matlab Bioinformatics Toolbox) or readFasta (a code I made for another project).
Also, cellfun and regexp become pretty handy tools.
To get GC %:
[Header, Seq] = readFasta('Seq.txt');
PercGC = cellfun(@(S)length(regexpi(S, 'G|C'))/length(S)*100, Seq);
PercGC =
48.5876
45.0000
55.1724

Categories

Find more on Cell Arrays in Help Center and File Exchange

Tags

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!