Main Content

Store and Manage Feature Annotations in Objects

Represent Feature Annotations in a GFFAnnotation or GTFAnnotation Object

The GFFAnnotation and GTFAnnotation objects represent a collection of feature annotations for one or more reference sequences. You construct these objects from GFF (General Feature Format) and GTF (Gene Transfer Format) files. Each element in the object represents a single annotation. The properties and methods associated with the objects let you investigate and filter the data based on reference sequence, a feature (such as CDS or exon), or a specific gene or transcript.

Construct an Annotation Object

Use the GFFAnnotation constructor function to construct a GFFAnnotation object from either a GFF- or GTF-formatted file:

GFFAnnotObj = GFFAnnotation('tair8_1.gff')
GFFAnnotObj = 

  GFFAnnotation with properties:

    FieldNames: {1x9 cell}
    NumEntries: 3331

Use the GTFAnnotation constructor function to construct a GTFAnnotation object from a GTF-formatted file:

GTFAnnotObj = GTFAnnotation('hum37_2_1M.gtf')
GTFAnnotObj = 

  GTFAnnotation with properties:

    FieldNames: {1x11 cell}
    NumEntries: 308

Retrieve General Information from an Annotation Object

Determine the field names and the number of entries in an annotation object by accessing the FieldNames and NumEntries properties. For example, to see the field names for each annotation object constructed in the previous section, query the FieldNames property:

GFFAnnotObj.FieldNames
ans = 

  Columns 1 through 6

   'Reference'   'Start'   'Stop'   'Feature'   'Source'   'Score'

  Columns 7 through 9

   'Strand'   'Frame'   'Attributes'
GTFAnnotObj.FieldNames
ans = 

  Columns 1 through 6

   'Reference'   'Start'   'Stop'   'Feature'   'Gene'   'Transcript'

  Columns 7 through 11

   'Source'   'Score'   'Strand'   'Frame'   'Attributes'

Determine the range of the reference sequences that are covered by feature annotations by using the getRange method with the annotation object constructed in the previous section:

range = getRange(GFFAnnotObj)
range =

        3631      498516

Access Data in an Annotation Object

Create a Structure of the Annotation Data

Creating a structure of the annotation data lets you access the field values. Use the getData method to create a structure containing a subset of the data in a GFFAnnotation object constructed in the previous section.

% Extract annotations for positions 1 through 10000 of the 
% reference sequence
AnnotStruct = getData(GFFAnnotObj,1,10000)
AnnotStruct = 

60x1 struct array with fields:
    Reference
    Start
    Stop
    Feature
    Source
    Score
    Strand
    Frame
    Attributes

Access Field Values in the Structure

Use dot indexing to access all or specific field values in a structure.

For example, extract the start positions for all annotations:

Starts = AnnotStruct.Start;

Extract the start positions for annotations 12 through 17. Notice that you must use square brackets when indexing a range of positions:

Starts_12_17 = [AnnotStruct(12:17).Start]
Starts_12_17 =

   4706        5174        5174        5439        5439        5631

Extract the start position and the feature for the 12th annotation:

Start_12 = AnnotStruct(12).Start
Start_12 =

        4706
Feature_12 = AnnotStruct(12).Feature
Feature_12 =

CDS

Use Feature Annotations with Sequence Read Data

Investigate the results of HTS sequencing experiments by using GFFAnnotation and GTFAnnotation objects with BioMap objects. For example, you can:

  • Determine counts of sequence reads aligned to regions of a reference sequence associated with specific annotations, such as in RNA-Seq workflows.

  • Find annotations within a specific range of a peak of interest in a reference sequence, such as in ChIP-Seq workflows.

Determine Annotations of Interest

  1. Construct a GTFAnnotation object from a GTF- formatted file:

    GTFAnnotObj = GTFAnnotation('hum37_2_1M.gtf');
    
  2. Use the getReferenceNames method to return the names for the reference sequences for the annotation object:

    refNames = getReferenceNames(GTFAnnotObj)
    refNames = 
    
        'chr2'
  3. Use the getFeatureNames method to retrieve the feature names from the annotation object:

    featureNames = getFeatureNames(GTFAnnotObj)
    featureNames = 
    
        'CDS'
        'exon'
        'start_codon'
        'stop_codon'
    
  4. Use the getGeneNames method to retrieve a list of the unique gene names from the annotation object:

    geneNames = getGeneNames(GTFAnnotObj)
    geneNames = 
    
        'uc002qvu.2'
        'uc002qvv.2'
        'uc002qvw.2'
        'uc002qvx.2'
        'uc002qvy.2'
        'uc002qvz.2'
        'uc002qwa.2'
        'uc002qwb.2'
        'uc002qwc.1'
        'uc002qwd.2'
        'uc002qwe.3'
        'uc002qwf.2'
        'uc002qwg.2'
        'uc002qwh.2'
        'uc002qwi.3'
        'uc002qwk.2'
        'uc002qwl.2'
        'uc002qwm.1'
        'uc002qwn.1'
        'uc002qwo.1'
        'uc002qwp.2'
        'uc002qwq.2'
        'uc010ewe.2'
        'uc010ewf.1'
        'uc010ewg.2'
        'uc010ewh.1'
        'uc010ewi.2'
        'uc010yim.1'
    

The previous steps gave us a list of available reference sequences, features, and genes associated with the available annotations. Use this information to determine annotations of interest. For instance, you might be interested only in annotations that are exons associated with the uc002qvv.2 gene on chromosome 2.

Filter Annotations

Use the getData method to filter the annotations and create a structure containing only the annotations of interest, which are annotations that are exons associated with the uc002qvv.2 gene on chromosome 2.

AnnotStruct = getData(GTFAnnotObj,'Reference','chr2',...
                      'Feature','exon','Gene','uc002qvv.2')
AnnotStruct = 

12x1 struct array with fields:
    Reference
    Start
    Stop
    Feature
    Gene
    Transcript
    Source
    Score
    Strand
    Frame
    Attributes

The return structure contains 12 elements, indicating there are 12 annotations that meet your filter criteria.

Extract Position Ranges for Annotations of Interest

After filtering the data to include only annotations that are exons associated with the uc002qvv.2 gene on chromosome 2, use the Start and Stop fields to create vectors of the start and end positions for the ranges associated with the 12 annotations.

StartPos = [AnnotStruct.Start];
EndPos = [AnnotStruct.Stop];

Determine Counts of Sequence Reads Aligned to Annotations

Construct a BioMap object from a BAM-formatted file containing sequence read data aligned to chromosome 2.

BMObj3 = BioMap('ex3.bam');

Then use the range for the annotations of interest as input to the getCounts method of a BioMap object. This returns the counts of short reads aligned to the annotations of interest.

counts = getCounts(BMObj3,StartPos,EndPos,'independent', true)
counts =

        1399
           1
          54
         221
          97
         125
           0
           1
           0
          65
           9
          12