getting the nth term out of a sequence

3 views (last 30 days)
SANGBIN LEE
SANGBIN LEE on 29 Feb 2024
Edited: John D'Errico on 29 Feb 2024
% Define the input and output file names
inputFileName = 'KIF11.txt';
outputFileName = 'CDS.txt';
% Read the sequence from the input file
fid = fopen(inputFileName, 'r');
sequence = fscanf(fid, '%c');
fclose(fid);
% Define the start and end positions of the CDS
cdsStart = 155;
cdsEnd = 3358;
% Extract the CDS from the sequence
cdsSequence = sequence(cdsStart:cdsEnd);
% Write the CDS sequence to a new file
fid = fopen(outputFileName, 'w');
fprintf(fid, '%s', cdsSequence);
fclose(fid);
I have the code above which is supposed to pull out the 155th term to the 3358th term in the text file that I have. For some reason when I run the code, it shows me the 153rd term to the 3356th term. Is something wrong with the code?
  3 Comments
Walter Roberson
Walter Roberson on 29 Feb 2024
sequence = fscanf(fid, '%c');
beware: the character codes returned in sequence will include any end-of-line characters that might be there (possibly carriage return and line feed). Linear indexing into that is a bit uncertain because of the uncertainty over whether carriage returns are present or not.

Sign in to comment.

Answers (1)

Dyuman Joshi
Dyuman Joshi on 29 Feb 2024
Edited: Dyuman Joshi on 29 Feb 2024
As @Walter has warned, a carriage return character (\r) is being read along with the data -
% Define the input and output file names
inputFileName = 'KIF11.txt';
outputFileName = 'CDS.txt';
% Read the sequence from the input file
fid = fopen(inputFileName, 'r');
sequence = fscanf(fid, '%c');
fclose(fid);
size(sequence)
ans = 1×2
1 3736
%Expected - last character of the 1st line and first character of the 2nd line
%Output is not according to that
y = sequence(70:71)
y =
'T '
double(y)
ans = 1×2
84 13
Alternatively, you can use textscan here -
Fid = fopen(inputFileName, 'r');
out = textscan(Fid, '%c')
out = 1×1 cell array
{3682×1 char}
seq = out{1};
y = seq(70:71)
y = 2×1 char array
'T' 'G'
% Define the start and end positions of the CDS
cdsStart = 155;
cdsEnd = 3358;
% Extract the CDS from the sequence
cdsSequence = sequence(cdsStart:cdsEnd);
% Write the CDS sequence to a new file
fid = fopen(outputFileName, 'w');
fprintf(fid, '%s', cdsSequence);
fclose(fid);
  1 Comment
John D'Errico
John D'Errico on 29 Feb 2024
Edited: John D'Errico on 29 Feb 2024
+1. I was going to point this out:
find(~ismember(sequence,'CAGT'))
ans =
Columns 1 through 8
71 142 213 284 355 426 497 568
Columns 9 through 16
639 710 781 852 923 994 1065 1136
Columns 17 through 24
1207 1278 1349 1420 1491 1562 1633 1704
Columns 25 through 32
1775 1846 1917 1988 2059 2130 2201 2272
Columns 33 through 40
2343 2414 2485 2556 2627 2698 2769 2840
Columns 41 through 48
2911 2982 3053 3124 3195 3266 3337 3408
Columns 49 through 54
3479 3550 3621 3692 3735 3736
So there are two invisible characters in there before 155. They fell where carriage return characters will lie. That explains why it looks like the sequence was read by exactly 2 characters off.
So by deleting those elements first, then an index into the repaired string would work.

Sign in to comment.

Categories

Find more on Large Files and Big Data in Help Center and File Exchange

Tags

Products


Release

R2023b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!