Working with Whole Genome Data

Open Script

This example shows how to create a memory mapped file for sequence data and work with it without loading all the genomic sequence into memory. Whole genomes are available for human, mouse, rat, fugu, and several other model organisms. For many of these organisms one chromosome can be several hundred million base pairs long. Working with such large data sets can be challenging as you may run into limitations of the hardware and software that you are using. This example shows one way to work around these limitations in MATLAB®.

Large Data Set Handling Issues

Solving technical computing problems that require processing and analyzing large amounts of data puts a high demand on your computer system. Large data sets take up significant memory during processing and can require many operations to compute a solution. It can also take a long time to access information from large data files.

Computer systems, however, have limited memory and finite CPU speed. Available resources vary by processor and operating system, the latter of which also consumes resources. For example:

32-bit processors and operating systems can address up to 2^32 = 4,294,967,296 = 4 GB of memory (also known as virtual address space). Windows® XP and Windows® 2000 allocate only 2 GB of this virtual memory to each process (such as MATLAB). On UNIX®, the virtual memory allocated to a process is system-configurable and is typically around 3 GB. The application carrying out the calculation, such as MATLAB, can require storage in addition to the user task. The main problem when handling large amounts of data is that the memory requirements of the program can exceed that available on the platform. For example, MATLAB generates an "out of memory" error when data requirements exceed approximately 1.7 GB on Windows XP.

For more details on memory management and large data sets, see Performance and Memory.

On a typical 32-bit machine, the maximum size of a single data set that you can work with in MATLAB is a few hundred MB, or about the size of a large chromosome. Memory mapping of files allows MATLAB to work around this limitation and enables you to work with very large data sets in an intuitive way.

Whole Genome Data Sets

The latest whole genome data sets can be downloaded from the Ensembl Website. The data are provided in several formats. These are updated regularly as new sequence information becomes available. This example will use human DNA data stored in FASTA format. Chromosome 1 is (in the GRCh37.56 Release of September 2009) a 65.6 MB compressed file. After uncompressing the file it is about 250MB. MATLAB uses 2 bytes per character, so if you read the file into MATLAB, it will require about 500MB of memory.

This example assumes that you have already downloaded and uncompressed the FASTA file into your local directory. Change the name of the variable FASTAfilename if appropriate.

FASTAfilename = 'Homo_sapiens.GRCh37.56.dna.chromosome.1.fa';
fileInfo = dir(which(FASTAfilename));

Memory Mapped Files

Memory mapping allows MATLAB to access data in a file as though it is in memory. You can use standard MATLAB indexing operations to access data. See the documentation for memmapfile for more details.

You could just map the FASTA file and access the data directly from there. However the FASTA format file includes new line characters. The memmapfile function treats these characters in the same way as all other characters. Removing these before memory mapping the file will make indexing operations simpler. Also, memory mapping does not work directly with character data so you will have to treat the data as 8-bit integers (uint8 class). The function nt2int in the Bioinformatics Toolbox™ can be used to convert character information into integer values. int2nt is used to convert back to characters.

First open the FASTA file and extract the header.

fidIn = fopen(FASTAfilename,'r');
header = fgetl(fidIn)

header =

    '>1 dna:chromosome chromosome:GRCh37:1:1:249250621:1'

Open the file to be memory mapped.

[fullPath, filename, extension] = fileparts(FASTAfilename);
mmFilename = [filename '.mm']
fidOut = fopen(mmFilename,'w');

mmFilename =

    'Homo_sapiens.GRCh37.56.dna.chromosome.1.mm'

Read the FASTA file in blocks of 1MB, remove new line characters, convert to uint8, and write to the MM file.

newLine = sprintf('\n');
blockSize = 2^20;
while ~feof(fidIn)
    % Read in the data
    charData = fread(fidIn,blockSize,'*char')';
    % Remove new lines
    charData = strrep(charData,newLine,'');
    % Convert to integers
    intData = nt2int(charData);
    % Write to the new file
    fwrite(fidOut,intData,'uint8');
end

Close the files.

fclose(fidIn);
fclose(fidOut);

The new file is about the same size as the old file but does not contain new lines or the header information.

mmfileInfo = dir(mmFilename);

Accessing the Data in the Memory Mapped File

The command memmapfile constructs a memmapfile object that maps the new file to memory. In order to do this, it needs to know the format of the file. The format of this file is simple, though much more complicated formats can be mapped.

chr1 = memmapfile(mmFilename, 'format', 'uint8');

The MEMMAPFILE Object

The memmapfile object has various properties. Filename stores the full path to the file. Writable indicates whether or not the data can be modified. Note that if you do modify the data, this will also modify the original file. Offset allows you to specify the space used by any header information. Format indicates the data format. Repeat is used to specify how many blocks (as defined by Format) to map. This can be useful for limiting how much memory is used to create the memory map. These properties can be accessed in the same way as other MATLAB data. For more details see type help memmapfile or doc memmapfile.

chr1.Data(1:10)

ans =

  10×1 uint8 column vector

   15
   15
   15
   15
   15
   15
   15
   15
   15
   15

You can access any region of the data using indexing operations.

chr1.Data(10000000:10000010)'

ans =

  1×11 uint8 row vector

   1   1   2   2   2   2   3   4   2   4   2

Remember that the nucleotide information was converted to integers. You can use int2nt to get the sequence information back.

int2nt(chr1.Data(10000000:10000010)')

ans =

    'AACCCCGTCTC'

Or use seqdisp to display the sequence.

seqdisp(chr1.Data(10000000:10001000)')

ans =

  17×71 char array

    '   1  AACCCCGTCT CTACAATAAA TTAAAATATT AGCTGGGCAT GGTGGTGTGT GCTTGTAGTC'
    '  61  CCAGCTACTT GGCGGGCTGA GGTGGGAGAA TCATCCAAGC CTTGGAGGCA GAGGTTGCAG'
    ' 121  TGAGCTGAGA TTGTGACACT GCACTCCAGC CTGGGAGACA GAGTGAGACT CCTACTCAAA'
    ' 181  AAAAAACAAA AAACAAAAAA CAAACCACAA AACTTTCCAG GTAACTTATT AAAACATGTT'
    ' 241  TTTTGTTTGT TTTGAGACAG AGTCTTGCTC TGTCGCCCAG GCTGGAGTGC AGTGGAGCAA'
    ' 301  TCTCAGCTCA CTGCAAGCTC CGCCTCCCGG GTTCACACCA TTCTCCTGCC TCAGCCTCCC'
    ' 361  GAGTAGCTAG GACTATAGGC ACCCGCCACC ACGCCCAGCT TATTTTTTTT GTATTTTTTA'
    ' 421  GTAGAGACGG GGTTTCATCG TGTTAGCCAG GATGGTCTCG ATCTCCTGAC CTCGTGATCC'
    ' 481  GCCCACCTCA GCCTCCCAAA GTGCTGGGAT TACAGGCGTG AGCCACTGCA CCCGGCCTAG'
    ' 541  TTTTTGTATA TTTTTTTTAG TAGAGACAGG GTTTCACCAT GTTAGCCAGG ATGGTCTCAA'
    ' 601  TCTCCTGACC TCGTGATCCG CCCGCCTCGG CCTCCCAAAG TGCTGGGGTT ACAGGCGTGA'
    ' 661  GCCACCGCAC ACAGCATTAA AGCATGTTTT ATTTTCCTAC ACATAATGAA ATCATTACCA'
    ' 721  GATGATTTGA CATGTGTACT TCATTGGAGA GGATTCTTAC AGTATATTCA AAATTAAATA'
    ' 781  TAATGACAAA AAATTACTAC CTAATCTATT AAAATTGGCA TAAGTCATCT ATGATCATTA'
    ' 841  ATGATATGCA AACATAAACA AGTATTATAC CCAGAAGTGT AATTTATTGT AGCTACATCT'
    ' 901  TATGTATAAT AGTTTAGTGG ATTTTTCCTG GAAATTGTCC ATTTTAATTT TTCTCTTAAG'
    ' 961  TCTGTGGAAT TTTCCAGTAA AAGTCAAGGC AAACCCAAGA T                    '

Analysis of the Whole Chromosome

Now that you can easily access the whole chromosome, you can analyze the data. This example shows one way to look at the GC content along the chromosome.

You extract blocks of 500000bp and calculate the GC content.

Calculate how many blocks to use.

numNT = numel(chr1.Data);
blockSize = 500000;
numBlocks = floor(numNT/blockSize);

One way to help MATLAB performance when working with large data sets is to "preallocate" space for data. This allows MATLAB to allocate enough space for all of the data rather than having to grow the array in small chunks. This will speed things up and also protect you from problems of the data getting too large to store. For more details on pre-allocating arrays, see: https://www.mathworks.com/matlabcentral/answers/99124-how-do-i-pre-allocate-memory-when-using-matlab

An easy way to preallocate an array is to use the zeros function.

ratio = zeros(numBlocks+1,1);

Loop over the data looking for C or G and then divide this number by the total number of A, T, C, and G. This will take about a minute to run.

A = nt2int('A'); C = nt2int('C'); G = nt2int('G'); T = nt2int('T');

for count = 1:numBlocks
    % calculate the indices for the block
    start = 1 + blockSize*(count-1);
    stop = blockSize*count;
    % extract the block
    block = chr1.Data(start:stop);
    % find the GC and AT content
    gc = (sum(block == G | block == C));
    at = (sum(block == A | block == T));
    % calculate the ratio of GC to the total known nucleotides
    ratio(count) = gc/(gc+at);
end

The final block is smaller so treat this as a special case.

block = chr1.Data(stop+1:end);
gc = (sum(block == G | block == C));
at = (sum(block == A | block == T));
ratio(end) = gc/(gc+at);

Plot of the GC Content for the Homo Sapiens Chromosome 1

xAxis = [1:blockSize:numBlocks*blockSize, numNT];
plot(xAxis,ratio)
xlabel('Base pairs');
ylabel('Relative GC content');
title('Relative GC content of Homo Sapiens Chromosome 1')

The region in the center of the plot around 140Mbp is a large region of Ns.

seqdisp(chr1.Data(140000000:140001000))

ans =

  17×71 char array

    '   1  NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN'
    '  61  NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN'
    ' 121  NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN'
    ' 181  NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN'
    ' 241  NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN'
    ' 301  NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN'
    ' 361  NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN'
    ' 421  NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN'
    ' 481  NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN'
    ' 541  NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN'
    ' 601  NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN'
    ' 661  NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN'
    ' 721  NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN'
    ' 781  NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN'
    ' 841  NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN'
    ' 901  NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN'
    ' 961  NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN NNNNNNNNNN N                    '

Finding Regions of High GC Content

You can use find to identify regions of high GC content.

indices = find(ratio>0.5);
ranges = [(1 + blockSize*(indices-1)), blockSize*indices];
fprintf('Region %d:%d has GC content %f\n',[ranges ,ratio(indices)]')

Region 500001:1000000 has GC content 0.501412
Region 1000001:1500000 has GC content 0.598332
Region 1500001:2000000 has GC content 0.539498
Region 2000001:2500000 has GC content 0.594508
Region 2500001:3000000 has GC content 0.568620
Region 3000001:3500000 has GC content 0.584572
Region 3500001:4000000 has GC content 0.548137
Region 6000001:6500000 has GC content 0.545072
Region 9000001:9500000 has GC content 0.506692
Region 10500001:11000000 has GC content 0.511386
Region 11500001:12000000 has GC content 0.519874
Region 16000001:16500000 has GC content 0.513082
Region 17500001:18000000 has GC content 0.513392
Region 21500001:22000000 has GC content 0.505598
Region 156000001:156500000 has GC content 0.504446
Region 156500001:157000000 has GC content 0.504090
Region 201000001:201500000 has GC content 0.502976
Region 228000001:228500000 has GC content 0.511946

If you want to remove the temporary file, you must first clear the memmapfile object.

clear chr1
delete(mmFilename)