Getting strings to combine multiple times

8 views (last 30 days)
Ok I have a school project that I have to group a DNA sequence of 550437 codons together. At the moment I have it set up as a string. Basically 1 letter per cell on 550437 cells. I have to show how many times AAA, ATC, and CGG show up in that sequence without overlap. I also have to show the location of the first 10. I've tried reshaping from a 550437x1 to a 183479x3 but the order doesn't align every third from left to right. Column 1 will have the first 183479, the second column will have the second and the third column will have the final set. I would either like to group every 3 cells into one cell, or just give me a numeric notation telling me when my selected sequence shows up. Here's what I have so far to show me how many times each sequence shows up. Now I can't figure out how to find where the first 10 instances of each show up.
x=1;
i=1;%%%Variable for AAA
h=1;%%%Variable for ATC
t=1;%%%Variable for CGG
AAAmatch=0;%%%Sets up for exact match
ATCmatch=0;%%%Sets up for exact match
CGGmatch=0;%%%Sets up for exact match
AAAcount=0;%%%Counter for AAA match
ATCcount=0;%%%Counter for ATC match
CGGcount=0;%%%Counter for CGG match
%%%Locates AAA match in entire sequence without overlap
for i=1:length(DNA)-2
if strcmp(DNA(i),'A')
AAAmatch=AAAmatch+1;
end
if strcmp(DNA(i+1),'A')
AAAmatch=AAAmatch+1;
end
if strcmp(DNA(i+2),'A')
AAAmatch=AAAmatch+1;
end
if AAAmatch==3
AAAcount=1+AAAcount;
end
AAAmatch=0;
end
%%%Locates ATC match in entire sequence without overlap
for h=1:length(DNA)-2
if strcmp(DNA(h),'A')
ATCmatch=ATCmatch+1;
end
if strcmp(DNA(h+1),'T')
ATCmatch=ATCmatch+1;
end
if strcmp(DNA(h+2),'C')
ATCmatch=ATCmatch+1;
end
if ATCmatch==3
ATCcount=1+ATCcount;
end
ATCmatch=0;
end
%%%Locates CGG match in entire sequence without overlap
for t=1:length(DNA)-2
if strcmp(DNA(t),'C')
CGGmatch=CGGmatch+1;
end
if strcmp(DNA(t+1),'G')
CGGmatch=CGGmatch+1;
end
if strcmp(DNA(t+2),'G')
CGGmatch=CGGmatch+1;
end
if CGGmatch==3
CGGcount=1+CGGcount;
end
CGGmatch=0;
end
Thoughts?
  1 Comment
Azzi Abdelmalek
Azzi Abdelmalek on 28 Apr 2016
You can make your question clear and brief, by posting an example with the expected result. You can also add some explanations.

Sign in to comment.

Answers (1)

Walter Roberson
Walter Roberson on 28 Apr 2016
Consider using strfind() . But you do need to put in some logic to detect a potential overlap between the final character of one and the first of the next. Also if you had something like 'AAAA' then strfind() of 'AAA' will return both 1 and 2 (that is, strfind does not care about overlaps.) Still, strfind() will help give you candidate positions that you can winnow out.
What would you want the result to be if there was 'AAATCGG' in the sequence? Is that one AAA and one CGG, or is it one ATC ?
  2 Comments
Matthew Zehner
Matthew Zehner on 28 Apr 2016
Edited: Matthew Zehner on 28 Apr 2016
I've tried strfind. Since I'm working with cells with a single letter in them it doesn't work. I need to figure out AAA, ATC, and CGG individually. strfind only returns a [1] if it's true or []. And I only get the true or false if I use a single letter and not the 3 letters together. I don't get a numerical output as you would if you had a normal string like DNA='ATCAAACGGATCAACGTACAGTCATAC'. That would work rather easily. But since I have an array with over half a million cells strfind just tells me if there is the letter I'm looking for or not. Doesn't tell me there number.
Walter Roberson
Walter Roberson on 28 Apr 2016
horzcat(DNA{:}) and the result will be a string.

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!