Main Content

fastaread

Read data from FASTA file

Description

example

fastaStruct = fastaread(file) returns the sequence data from the input FASTA file as a structure.

example

fastaStruct = fastaread(file,Name=Value) uses additional options specified by one or more name-value arguments. For example, seqdata = fastaread(fastafile,IgnoreGaps=true) removes any gap symbol (- or .) from the sequences.

example

[header,sequence] = fastaread(___) returns the sequence data as separate variables: header and sequence. You can specify any of the input argument combinations in the previous syntaxes. If the file contains multiple sequences, header and sequences are cell arrays of sequence header and nucleotide or amino acid sequence information.

Examples

collapse all

Read the nucleotide sequence information of the human p53 tumor gene.

p53nt = fastaread("p53nt.txt")
p53nt = struct with fields:
      Header: 'gi|8400737|ref|NM_000546.2| Homo sapiens tumor protein p53 (Li-Fraumeni syndrome) (TP53), mRNA'
    Sequence: 'ACTTGTCATGGCGACTGTCCAGCTTTGTGCCAGGAGCCTCGCAGGGGTTGATGGGATTGGGGTTTTCCCCTCCCATGTGCTCAAGACTGGCGCTAAAAGTTTTGAGCTTCTCAAAAGTCTAGAGCCACCGTCCAGGGAGCAGGTAGCTGCTGGGCTCCGGGGACACTTTGCGTTCGGGCTGGGAGCGTGCTTTCCACGACGGTGACACGCTTCCCTGGATTGGCAGCCAGACTGCCTTCCGGGTCACTGCCATGGAGGAGCCGCAGTCAGATCCTAGCGTCGAGCCCCCTCTGAGTCAGGAAACATTTTCAGACCTATGGAAACTACTTCCTGAAAACAACGTTCTGTCCCCCTTGCCGTCCCAAGCAATGGATGATTTGATGCTGTCCCCGGACGATATTGAACAATGGTTCACTGAAGACCCAGGTCCAGATGAAGCTCCCAGAATGCCAGAGGCTGCTCCCCGCGTGGCCCCTGCACCAGCAGCTCCTACACCGGCGGCCCCTGCACCAGCCCCCTCCTGGCCCCTGTCATCTTCTGTCCCTTCCCAGAAAACCTACCAGGGCAGCTACGGTTTCCGTCTGGGCTTCTTGCATTCTGGGACAGCCAAGTCTGTGACTTGCACGTACTCCCCTGCCCTCAACAAGATGTTTTGCCAACTGGCCAAGACCTGCCCTGTGCAGCTGTGGGTTGATTCCACACCCCCGCCCGGCACCCGCGTCCGCGCCATGGCCATCTACAAGCAGTCACAGCACATGACGGAGGTTGTGAGGCGCTGCCCCCACCATGAGCGCTGCTCAGATAGCGATGGTCTGGCCCCTCCTCAGCATCTTATCCGAGTGGAAGGAAATTTGCGTGTGGAGTATTTGGATGACAGAAACACTTTTCGACATAGTGTGGTGGTGCCCTATGAGCCGCCTGAGGTTGGCTCTGACTGTACCACCATCCACTACAACTACATGTGTAACAGTTCCTGCATGGGCGGCATGAACCGGAGGCCCATCCTCACCATCATCACACTGGAAGACTCCAGTGGTAATCTACTGGGACGGAACAGCTTTGAGGTGCGTGTTTGTGCCTGTCCTGGGAGAGACCGGCGCACAGAGGAAGAGAATCTCCGCAAGAAAGGGGAGCCTCACCACGAGCTGCCCCCAGGGAGCACTAAGCGAGCACTGCCCAACAACACCAGCTCCTCTCCCCAGCCAAAGAAGAAACCACTGGATGGAGAATATTTCACCCTTCAGATCCGTGGGCGTGAGCGCTTCGAGATGTTCCGAGAGCTGAATGAGGCCTTGGAACTCAAGGATGCCCAGGCTGGGAAGGAGCCAGGGGGGAGCAGGGCTCACTCCAGCCACCTGAAGTCCAAAAAGGGTCAGTCTACCTCCCGCCATAAAAAACTCATGTTCAAGACAGAAGGGCCTGACTCAGACTGACATTCTCCACTTCTTGTTCCCCACTGACAGCCTCCCACCCCCATCTCTCCCTCCCCTGCCATTTTGGGTTTTGGGTCTTTGAACCCTTGCTTGCAATAGGTGTGCGTCAGAAGCACCCAGGACTTCCATTTGCTTTGTCCCGGGGCTCCACTGAACAAGTTGGCCTGCACTGGTGTTTTGTTGTGGGGAGGAGGATGGGGAGTAGGACATACCAGCTTAGATTTTAAGGTTTTTACTGTGAGGGATGTTTGGGAGATGTAAGAAATGTTCTTGCAGTTAAGGGTTAGTTTACAATCAGCCACATTCTAGGTAGGTAGGGGCCCACTTCACCGTACTAACCAGGGAAGCTGTCCCTCATGTTGAATTTTCTCTAACTTCAAGGCCCATATCTGTGAAATGCTGGCATTTGCACCTACCTCACAGAGTGCATTGTGAGGGTTAATGAAATAATGTACATCTGGCCTTGAAACCACCTTTTATTACATGGGGTCTAAAACTTGACCCCCTTGAGGGTGCCTGTTCCCTCTCCCTCTCCCTGTTGGCTGGTGGGTTGGTAGTTTCTACAGTTGGGCAGCTGGTTAGGTAGAGGGAGTTGTCAAGTCTTGCTGGCCCAGCCAAACCCTGTCTGACAACCTCTTGGTCGACCTTAGTACCTAAAAGGAAATCTCACCCCATCCCACACCCTGGAGGATTTCATCTCTTGTATATGATGATCTGGATCCACCAAGACTTGTTTTATGCTCAGGGTCAATTTCTTTTTTCTTTTTTTTTTTTTTTTTTCTTTTTCTTTGAGACTGGGTCTCGCTTTGTTGCCCAGGCTGGAGTGGAGTGGCGTGATCTTGGCTTACTGCAGCCTTTGCCTCCCCGGCTCGAGCAGTCCTGCCTCAGCCTCCGGAGTAGCTGGGACCACAGGTTCATGCCACCATGGCCAGCCAACTTTTGCATGTTTTGTAGAGATGGGGTCTCACAGTGTTGCCCAGGCTGGTCTCAAACTCCTGGGCTCAGGCGATCCACCTGTCTCAGCCTCCCAGAGTGCTGGGATTACAATTGTGAGCCACCACGTGGAGCTGGAAGGGTCAACATCTTTTACATTCTGCAAGCACATCTGCATTTTCACCCCACCCTTCCCCTCCTTCTCCCTTTTTATATCCCATTTTTATATCGATCTCTTATTTTACAATAAAACTTTGCTGCCA'

Read the amino acid sequence information of p53 protein.

p53aa = fastaread("p53aa.txt")
p53aa = struct with fields:
      Header: 'gi|8400738|ref|NP_000537.2| tumor protein p53 [Homo sapiens]'
    Sequence: 'MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPRVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPGGSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD'

Read a block of entries from the 5th to 10th sequences from a FASTA file ignoring gaps from each sequence.

pf2 = fastaread('pf00002.fa',BlockRead=[5 10],IgnoreGaps=true)
pf2=6×1 struct array with fields:
    Header
    Sequence

Input Arguments

collapse all

Name of a FASTA-formatted file or sequence information, specified as a character vector, character array, or string scalar.

You specify either of the following:

  • File name, a path and file name, or a URL pointing to a file. The referenced file is a FASTA-formatted file (ASCII text file). If you specify only a file name, that file must be on the MATLAB® search path or in the MATLAB Current Folder.

  • MATLAB character array that contains the text of a FASTA-formatted file.

A FASTA-formatted file begins with a right angle bracket (>) and a single line description. Following this description is the sequence information as a series of lines. Sequences must use the standard IUB/IUPAC amino acid and nucleotide letter codes.

For a list of codes, see aminolookup and baselookup.

Data Types: char | string

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Example: seqdata = fastaread(fastafile,TrimHeaders=true,TimeOut=10)

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: seqdata = fastaread(fastafile,'TrimHeaders',true,'TimeOut',10)

Flag to remove any gap symbols (- or .) from the sequences, specified as a logical 1 (true) or 0 (false).

Sequence entry or blocks to read from the input file with multiple sequences, specified as a positive integer or 1-by-2 vector of positive integers.

Specify a scalar positive integer n to read in the nth entry in the file.

Specify a two-element vector [m1 m2] to read in a block of entries starting at the m1 entry and ending at the m2 entry. Use Inf for m2 to read all entries in the file starting at m1.

Data Types: double

Flag to trim the header after the first white space, specified as a logical 1 (true) or 0 (false). White space characters include a space (char(32)) and a tab (char(9)).

Connection time out in seconds to read from a remote EMBL-EBI file, specified as a positive scalar. For details, see here.

Data Types: double

Output Arguments

collapse all

Sequence data, returned as a structure. The structure contains the following fields:

FieldDescription
HeaderHeader information.
SequenceSingle letter-code representation of a nucleotide or amino acid sequence.

Sequence header information, returned as a character vector or cell array of character vectors.

Data Types: char | cell

Single letter-coded nucleotide or amino acid sequences, returned as a character vector or cell array of character vectors.

Data Types: char | cell

Version History

Introduced before R2006a