Classify Out-of-Memory Text Data Using Deep Learning

This example shows how to classify out-of-memory text data with a deep learning network using a transformed datastore.

A transformed datastore transforms or processes data read from an underlying datastore You can use a transformed datastore as a source of training, validation, test, and prediction data sets for deep learning applications. Use transformed datastores to read out-of-memory data or to perform specific preprocessing operations when reading batches of data.

When training the network, the software creates mini-batches of sequences of the same length by padding, truncating, or splitting the input data. The trainingOptions function provides options to pad and truncate input sequences, however, these options are not well suited for sequences of word vectors. Furthermore, this function does not support padding data in a custom datastore. Instead, you must pad and truncate the sequences manually. If you left-pad and truncate the sequences of word vectors, then the training might improve.

The Classify Text Data Using Deep Learning example manually truncates and pads all the documents to the same length. This process adds lots of padding to very short documents and discards lots of data from very long documents.

Alternatively, to prevent adding too much padding or discarding too much data, create a transformed datastore that inputs mini-batches into the network. The datastore created in this example converts mini-batches of documents to sequences or word indices and left-pads each mini-batch to the length of the longest document in the mini-batch.

Load Pretrained Word Embedding

The datastore requires a word embedding to convert documents to sequences of vectors. Load a pretrained word embedding using fastTextWordEmbedding. This function requires Text Analytics Toolbox™ Model for fastText English 16 Billion Token Word Embedding support package. If this support package is not installed, then the function provides a download link.

emb = fastTextWordEmbedding;

Load Data

Create a tabular text datastore from the data in weatherReportsTrain.csv. Specify to read the data from the "event_narrative" and "event_type" columns only.

filenameTrain = "weatherReportsTrain.csv";
textName = "event_narrative";
labelName = "event_type";
ttdsTrain = tabularTextDatastore(filenameTrain,'SelectedVariableNames',[textName labelName]);

View a preview of the datastore.

preview(ttdsTrain)
ans=8×2 table
                                                                                             event_narrative                                                                                                 event_type     
    _________________________________________________________________________________________________________________________________________________________________________________________________    ___________________

    'Large tree down between Plantersville and Nettleton.'                                                                                                                                               'Thunderstorm Wind'
    'One to two feet of deep standing water developed on a street on the Winthrop University campus after more than an inch of rain fell in less than an hour. One vehicle was stalled in the water.'    'Heavy Rain'       
    'NWS Columbia relayed a report of trees blown down along Tom Hall St.'                                                                                                                               'Thunderstorm Wind'
    'Media reported two trees blown down along I-40 in the Old Fort area.'                                                                                                                               'Thunderstorm Wind'
    'A few tree limbs greater than 6 inches down on HWY 18 in Roseland.'                                                                                                                                 'Thunderstorm Wind'
    'Awning blown off a building on Lamar Avenue. Multiple trees down near the intersection of Winchester and Perkins.'                                                                                  'Thunderstorm Wind'
    'Tin roof ripped off house on Old Memphis Road near Billings Drive. Several large trees down in the area.'                                                                                           'Thunderstorm Wind'
    'Powerlines down at Walnut Grove and Cherry Lane roads.'                                                                                                                                             'Thunderstorm Wind'

Transform Datastore

Create a custom transform function that converts data read from the datastore to a table containing the predictors and the responses. The transformTextData function takes the data read from a tabularTextDatastore object and returns a table of predictors and responses. The predictors are C-by-S arrays of word vectors given by the word embedding emb, where C is the embedding dimension and S is the sequence length. The responses are categorical labels over the classes.

To get the class names, read the labels from the training data using the readLabels function, listed and the end of the example, and find the unique class names.

labels = readLabels(ttdsTrain,labelName);
classNames = unique(labels);
numObservations = numel(labels);

Because tablular text datastores can read multiple rows of data in a single read, you can process a full mini-batch of data in the transform function. To ensure that the transform function processes a full mini-batch of data, set the read size of the tabular text datastore to the mini-batch size that will be used for training.

miniBatchSize = 128;
ttdsTrain.ReadSize = miniBatchSize;

To convert the output of the tabular text data to sequences for training, transform the datastore using the transform function.

tdsTrain = transform(ttdsTrain, @(data) transformTextData(data,emb,classNames))
tdsTrain = 
  TransformedDatastore with properties:

    UnderlyingDatastore: [1×1 matlab.io.datastore.TabularTextDatastore]
             Transforms: {@(data)transformTextData(data,emb,classNames)}
            IncludeInfo: 0

Preview of the transformed datastore. The predictors are C-by-S arrays, where S is the sequence length and C is the number of features (the embedding dimension). The responses are the categorical labels.

preview(tdsTrain)
ans=8×2 table
       predictors           responses    
    ________________    _________________

    [300×164 single]    Thunderstorm Wind
    [300×164 single]    Heavy Rain       
    [300×164 single]    Thunderstorm Wind
    [300×164 single]    Thunderstorm Wind
    [300×164 single]    Thunderstorm Wind
    [300×164 single]    Thunderstorm Wind
    [300×164 single]    Thunderstorm Wind
    [300×164 single]    Thunderstorm Wind

Create a transformed datastore containing the validation data in weatherReportsValidation.csv using the same steps.

filenameValidation = "weatherReportsValidation.csv";
ttdsValidation = tabularTextDatastore(filenameValidation,'SelectedVariableNames',[textName labelName]);
ttdsValidation.ReadSize = miniBatchSize;
tdsValidation = transform(ttdsValidation, @(data) transformTextData(data,emb,classNames))
tdsValidation = 
  TransformedDatastore with properties:

    UnderlyingDatastore: [1×1 matlab.io.datastore.TabularTextDatastore]
             Transforms: {@(data)transformTextData(data,emb,classNames)}
            IncludeInfo: 0

Create and Train LSTM Network

Define the LSTM network architecture. To input sequence data into the network, include a sequence input layer and set the input size to the embedding dimension. Next, include an LSTM layer with 180 hidden units. To use the LSTM layer for a sequence-to-label classification problem, set the output mode to 'last'. Finally, add a fully connected layer with output size equal to the number of classes, a softmax layer, and a classification layer.

numFeatures = emb.Dimension;
numHiddenUnits = 180;
numClasses = numel(classNames);
layers = [ ...
    sequenceInputLayer(numFeatures)
    lstmLayer(numHiddenUnits,'OutputMode','last')
    fullyConnectedLayer(numClasses)
    softmaxLayer
    classificationLayer];

Specify the training options. Specify the solver to be 'adam' and the gradient threshold to be 2. The datastore does not support shuffling, so set 'Shuffle', to 'never'. Validate the network once per epoch. To monitor the training progress, set the 'Plots' option to 'training-progress'. To suppress verbose output, set 'Verbose' to false.

By default, trainNetwork uses a GPU if one is available (requires Parallel Computing Toolbox™ and a CUDA® enabled GPU with compute capability 3.0 or higher). Otherwise, it uses the CPU. To specify the execution environment manually, use the 'ExecutionEnvironment' name-value pair argument of trainingOptions. Training on a CPU can take significantly longer than training on a GPU.

numIterationsPerEpoch = floor(numObservations / miniBatchSize);

options = trainingOptions('adam', ...
    'MaxEpochs',15, ...
    'MiniBatchSize',miniBatchSize, ...
    'GradientThreshold',2, ...
    'Shuffle','never', ...
    'ValidationData',tdsValidation, ...
    'ValidationFrequency',numIterationsPerEpoch, ...
    'Plots','training-progress', ...
    'Verbose',false);

Train the LSTM network using the trainNetwork function.

net = trainNetwork(tdsTrain,layers,options);

Test LSTM Network

Create a transformed datastore containing the held-out test data in weatherReportsTest.csv.

filenameTest = "weatherReportsTest.csv";
ttdsTest = tabularTextDatastore(filenameTest,'SelectedVariableNames',[textName labelName]);
ttdsTest.ReadSize = miniBatchSize;
tdsTest = transform(ttdsTest, @(data) transformTextData(data,emb,classNames))
tdsTest = 
  TransformedDatastore with properties:

    UnderlyingDatastore: [1×1 matlab.io.datastore.TabularTextDatastore]
             Transforms: {@(data)transformTextData(data,emb,classNames)}
            IncludeInfo: 0

Read the labels from the tabularTextDatastore.

labelsTest = readLabels(ttdsTest,labelName);
YTest = categorical(labelsTest,classNames);

Make predictions on the test data using the trained network.

YPred = classify(net,tdsTest,'MiniBatchSize',miniBatchSize);

Calculate the classification accuracy on the test data.

accuracy = mean(YPred == YTest)
accuracy = 0.8293

Functions

The readLabels function creates a copy of the tabularTextDatastore object ttds and reads the labels from the labelName column.

function labels = readLabels(ttds,labelName)

ttdsNew = copy(ttds);
ttdsNew.SelectedVariableNames = labelName;
tbl = readall(ttdsNew);
labels = tbl.(labelName);

end

The transformTextData function takes the data read from a tabularTextDatastore object and returns a table of predictors and responses. The predictors are C-by-S arrays of word vectors given by the word embedding emb, where C is the embedding dimension and S is the sequence length. The responses are categorical labels over the classes in classNames.

function dataTransformed = transformTextData(data,emb,classNames)

% Preprocess documents.
textData = data{:,1};
textData = lower(textData);
documents = tokenizedDocument(textData);

% Convert to sequences.
predictors = doc2sequence(emb,documents);

% Read labels.
labels = data{:,2};
responses = categorical(labels,classNames);

% Convert data to table.
dataTransformed = table(predictors,responses);

end

See Also

| | | | | | |

Related Topics