how can i change an indice in Matrix as vector?

1 view (last 30 days)
I have sequences as character arrays. I need to search particular characters and change them with vectors(Boolean representations).
So finally i need 3 D matrix.
It worked for one sequences but i have 96000 more. I tried to do with loops but i get error.
Theese are my code for one sequences bu i need to do for 96000 sequences.
I need your help about that issue, Thanks in advance
p1_1=sequences;
% first sequence selected and converted to character array
Chp1_1=char(p1_1(1,:));
% from first character to end of sequences search for every character to replace boolean representation
SeqL = length(Chp1_1);
for i=1:SeqL
X = Chp1_1(1,i)
switch X
case 'A'
M(i,:) = A1;
case 'C'
M(i,:) = C1;
case 'D'
M(i,:) = D1;
case 'E'
M(i,:) = E1;
case 'F'
M(i,:) = F1;
case 'G'
M(i,:) = G1;
case 'H'
M(i,:) = H1;
case 'I'
M(i,:) = I1;
case 'K'
M(i,:) = K1;
case 'L'
M(i,:) = L1;
case 'M'
M(i,:) = M1;
case 'N'
M(i,:) = N1;
case 'P'
M(i,:) = P1;
case 'Q'
M(i,:) = Q1;
case 'R'
M(i,:) = R1;
case 'S'
M(i,:) = S1;
case 'T'
M(i,:) = T1;
case 'V'
M(i,:) = V1;
case 'W'
M(i,:) = W1;
case 'Y'
M(i,:) = Y1;
end
end
  4 Comments
Duygu Geçkin
Duygu Geçkin on 26 Nov 2019
Thanks you very much for information about indexing.
I just use protein_1 notation to explain. My sequence data are in the {1,96000} string array.
Indexing very effcient but i am confused to apply one code part to all coloumn.
Guillaume
Guillaume on 26 Nov 2019
Edited: Guillaume on 26 Nov 2019
It's important to use notation that actually reflects your data. Otherwise, the code we give you might not work. It's also important to use the proper notation. Because now, we're left wondering:
  • Do you have numbered variables as per your Protein_1, Protein_2, etc.
  • Do you have a cell array of char vector as per your "{1,96000}" which is a cell array notation
  • Do you have a string array as per your "in the [...] string array"

Sign in to comment.

Answers (3)

Guillaume
Guillaume on 25 Nov 2019
First, probably the most important thing: numbered or sequentially named variables are always a very bad idea. they always make the code more complicated, not easier, to write. For example, with your protein_1, protein_2, ... protein_96000 you cannot easily apply the same code to each variable, whereas if you just had one variable, for example a cell array called protein, you could just use a loop to apply the same code to each:
for p = 1:numel(protein)
dosomethingwith(protein{p});
end
Same with your horrible switch...case and your A1, C1, etc. You end up rewriting many times the same thing with only one variation, with increased risk that you make a mistake on one line. Computers are very good at doing repetitive things, so why do you end up doing the repetition yourself.
Anything that is numbered or sequentially named should be just one variable that you index instead.
So, with regards to your transformation, first create two variables, the first one the list of letters to transform and the second one what they need to be transformed into, eg:
letters = 'ACDEFGHIKLMNPQSTVWY'.'; %column vector of letters
acid = [1 0 0 0 0;
0 1 0 0 0;
0 0 1 0 0;
0 0 0 1 0;
..etc.
];
For pretty display we could even put them into a table:
map = table(letters, acid);
Now that we have that transforming a sequence of letters into a 2D matrix is trivial:
prot = 'ACDKLMEGAC'; %content and length doesn't matter
[found, whichrow] = ismember(prot, map.letters); %find which row of letters correspond to each letter of prot
assert(all(found), 'some letters of the input are invalid');
transformed = map.acid(whichrow, :); %and use the correspond row of acid instead
%all done!
And assuming protein is the above mentioned cell array where all the sequences are the same length, then:
transformed = zeros(numel(protein{1}, size(map.acid, 2), numel(protein))); %preallocated 3D array
for p = 1:numel(protein)
[found, whichrow] = ismember(protein{p}, map.letters); %find which row of letters correspond to each letter of prot
assert(all(found), 'some letters of protein %d are invalid', p);
transformed(:, :, p) = map.acid(whichrow, :); %and use the correspond row of acid instead
end
See how short the code can be once you don't have numbered variables and use indexing instead?

Philippe Lebel
Philippe Lebel on 25 Nov 2019
I am not sure what you are trying to do as a whole, but if you want to quickly find where there are occurences of a certain string, use strfind().
a = 'aasdasffwfdasda';
your_sequence_of_bools_for_letter_a = [true false true];
idx = strfind(a,'a')
ans =
1 2 5 12 15
M=cell(1,length(a));
for i=1:length(idx)
M{idx(i)} = your_sequence_of_bools_for_letter_a;
end
  1 Comment
Duygu Geçkin
Duygu Geçkin on 25 Nov 2019
Thank you very much.
Actually, at first i used strfind command but:
I have sequences like protein_1 = 'ACD...'
protein_2 = 'CDA...'
:
:
protein_96000 = 'DAC...'
To represent aminoacids in my sequences i will change particular vectors.
For example: A = [1 0 0 0 0] C= [0 1 0 0 0] D = [0 0 1 0 0]
Boolean representing for Protein_1 will be = [ 1 0 0 0 0; 0 1 0 0 0; 0 0 1 0 0; ...] (2 dimention)
So dedided to convert my sequences from string into character arrays and use switch case for find and change aminoacid characters to Boolean vectors. But i have to made that process whole protein sequences. I am confused on that point to put for loop proteins and write into boolean representations into the new 3 D Matrix.
I hope this time i could tell more clearly.

Sign in to comment.


Philippe Lebel
Philippe Lebel on 25 Nov 2019
Now i understand.
Here is a solution that you can easily expand.
clear
protein(1).name = 'A';
protain(1).bool_value = [1 0 0];
protein(2).name = 'B';
protain(2).bool_value = [0 1 0];
protein(3).name = 'C';
protain(3).bool_value = [0 0 1];
protein_name_list = [protein.name];
sequences = ['ABC';'CCC';'CAB'];
M=cell(1,length(sequences));
for i=1:length(sequences)
resulting_bool = [];
sequence = sequences(i,:);
for j = 1:length(sequence)
idx = strfind(protein_name_list, sequence(j));
resulting_bool = [resulting_bool ;protain(idx).bool_value];
end
M{i} = resulting_bool;
end

Categories

Find more on Genomics and Next Generation Sequencing in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!