Read and Analyze Large Tabular Text File
This example shows how to create a datastore for a large text file containing tabular data, and then read and process the data one block at a time or one file at a time.
Create a Datastore
Create a datastore from the sample file airlinesmall.csv using the tabularTextDatastore function. When you create the datastore, you can specify that the text, NA, in the data is treated as missing data.
ds = tabularTextDatastore('airlinesmall.csv','TreatAsMissing','NA');
You can modify the properties of the datastore by changing its properties. Modify the MissingValue property to specify that missing values are treated as 0.
ds.MissingValue = 0;
In this example, select the variable for the arrival delay, ArrDelay, as the variable of interest.
ds.SelectedVariableNames = 'ArrDelay';Preview the data using the preview function. This function does not affect the state of the datastore.
data = preview(ds)
data=8×1 table
ArrDelay
________
8
8
21
13
4
59
3
11
Read Subsets of Data
By default, read reads from a TabularTextDatastore 20000 rows at a time. To read a different number of rows in each call to read, modify the ReadSize property of ds.
ds.ReadSize = 15000;
Read subsets of the data from ds using the read function in a while loop. The loop executes until hasdata(ds) returns false.
sums = []; counts = []; while hasdata(ds) T = read(ds); sums(end+1) = sum(T.ArrDelay); counts(end+1) = length(T.ArrDelay); end
Compute the average arrival delay.
avgArrivalDelay = sum(sums)/sum(counts)
avgArrivalDelay = 6.9670
Reset the datastore to allow rereading of the data.
reset(ds)
Read One File at a Time
A datastore can contain multiple files, each with a different number of rows. You can read from the datastore one complete file at a time by setting the ReadSize property to 'file'.
ds.ReadSize = 'file';When you change the value of ReadSize from a number to 'file' or vice versa, MATLAB® resets the datastore.
Read from ds using the read function in a while loop, as before, and compute the average arrival delay.
sums = []; counts = []; while hasdata(ds) T = read(ds); sums(end+1) = sum(T.ArrDelay); counts(end+1) = length(T.ArrDelay); end avgArrivalDelay = sum(sums)/sum(counts)
avgArrivalDelay = 6.9670
See Also
tabularTextDatastore | tall | mapreduce