Not sure if I set up this neural network correctly

13 views (last 30 days)
Below is my code as well as the information about the variables for a basic audio classification problem, which is reading an audio file and distinguishing whether the signal is a car horn or a dog barking. I followed the same format as this tutorial I found: https://www.mathworks.com/help/audio/gs/classify-sound-using-deep-learning.html.
I'm not sure where I went wrong, but when training the program did not plot the loss value. And when I tried to test a sample file, the result was "<undefined>". I would appreciate any help on this.
% --------------------------------------------------------------
% Loading Training and Evaluation Sets for Car Horn and Dog Bark
% --------------------------------------------------------------
carDataStore = UrbanSound8K(UrbanSound8K.class == "car_horn",:);
carDataStore = carDataStore(carDataStore.salience == 1,:);
dogDataStore = UrbanSound8K(UrbanSound8K.class == "dog_bark",:);
dogDataStore = dogDataStore(dogDataStore.salience == 1,:);
carData = [];
dogData = [];
% Add first 2 seconds of each audiofile to their respective matrices and
% produce labels
for i = 1:height(carDataStore)
thisfile = "UrbanSound8K\audio\fold" + string(carDataStore(i,:).fold) + "\" + string(carDataStore(i,:).slice_file_name);
if audioinfo(thisfile).Duration >= 2 && audioinfo(thisfile).SampleRate == 44100
[y,fs] = audioread(thisfile);
samples = [1,2*fs];
clear y fs;
[y,fs] = audioread(thisfile, samples);
carData = [carData,y(:,1)];
end
end
carLabels = repelem(categorical("car horn"),width(carData),1);
for i = 1:height(dogDataStore)
thisfile = "UrbanSound8K\audio\fold" + string(dogDataStore(i,:).fold) + "\" + string(dogDataStore(i,:).slice_file_name);
if audioinfo(thisfile).Duration >= 2 && audioinfo(thisfile).SampleRate == 44100
[y,fs] = audioread(thisfile);
samples = [1,2*fs];
clear y fs;
[y,fs] = audioread(thisfile, samples);
dogData = [dogData,y(:,1)];
end
end
dogLabels = repelem(categorical("dog barking"),width(dogData),1);
dogVals = round(0.8*width(dogData));
carVals = round(0.8*width(carData));
audioTrain = [dogData(:,1:dogVals),carData(:,1:carVals)];
labelsTrain = [dogLabels(1:dogVals);carLabels(1:carVals)];
audioValidation = [dogData(:,(dogVals + 1):end),carData(:,(carVals + 1):end)];
labelsValidation = [dogLabels((dogVals + 1):end);carLabels((carVals + 1):end)];
% ---------------------------------------------------------
% Audio Feature Extractor to reduce dimensionality of audio,
% Extracting slope and centroid of mel spectrum over time
% ---------------------------------------------------------
aFE = audioFeatureExtractor("SampleRate",fs, ...
"SpectralDescriptorInput","melSpectrum", ...
"spectralCentroid",true, ...
"spectralSlope",true);
featuresTrain = extract(aFE,audioTrain);
[numHopsPerSequence,numFeatures,numSignals] = size(featuresTrain);
featuresTrain = permute(featuresTrain,[2,1,3]);
featuresTrain = squeeze(num2cell(featuresTrain,[1,2]));
numSignals = numel(featuresTrain);
[numFeatures,numHopsPerSequence] = size(featuresTrain{1});
featuresValidation = extract(aFE,audioValidation);
featuresValidation = permute(featuresValidation,[2,1,3]);
featuresValidation = squeeze(num2cell(featuresValidation,[1,2]));
% ----------------------------------------
% Defining the Neural Network Architecture
% ----------------------------------------
layers = [ ...
sequenceInputLayer(numFeatures)
lstmLayer(50,"OutputMode","last")
fullyConnectedLayer(numel(unique(labelsTrain)))
softmaxLayer
classificationLayer];
options = trainingOptions("adam", ...
"Shuffle","every-epoch", ...
"ValidationData",{featuresValidation,labelsValidation}, ...
"Plots","training-progress", ...
"Verbose",false);
net = trainNetwork(featuresTrain,labelsTrain,layers,options);

Accepted Answer

Brian Hemmat
Brian Hemmat on 28 Dec 2020
Edited: Brian Hemmat on 28 Dec 2020
Hi Saketh,
I believe the example you're following is more of a 'hello-world' type example--your current code is trying to accomplish something more difficult. You'll probably need to extract features with more information, and depending on your end goal, also apply standardization.
Regarding your particular questions and why the network is not working, its difficult to say without being able to walk through your code (which would require access to that dataset which I don't have).
Below, I've written something that is similar to your code but using the ESC-10 dataset, which can be downloaded from mathworks support files. Hopefully reading through it will help with your current problem.
I changed the features extracted to mfcc the delta and delta-delta mfcc. The dataset does not have car sounds, so we're doing "dog" and "helicopter" instead. Instead of doing any trimming of the signal, we pass in cell arrays of features and tell the network how to trim the signals if they're not the same size. The amount of training and validation data is tiny, so we'll reduce the validation frequency to make sure validation data is plotted (this might be a similar issue to why you're not seeing loss).
% Download dataset
url = 'https://ssd.mathworks.com/supportfiles/audio/ESC-10.zip';
outputLocation = tempdir;
unzip(url,outputLocation)
% Create audioDatastore to point to dataset. Use the folder names as the
% labels.
esc10Datastore = audioDatastore(fullfile(outputLocation,'ESC-10'), ...
'IncludeSubfolders',true,'LabelSource','foldernames');
% Subset to only include 'dog' and 'helicopter' labels.
ads = subset(esc10Datastore,esc10Datastore.Labels==categorical("dog") | ...
esc10Datastore.Labels==categorical("helicopter"));
% Split the datastore into train and validation sets.
[adsTrain,adsValidation] = splitEachLabel(ads,0.8);
% Read a single signal from the train datastore and listen to it.
[audioIn,audioInfo] = read(adsTrain);
fs = audioInfo.SampleRate;
sound(audioIn,fs)
% Create an audioFeatureExtractor
aFE = audioFeatureExtractor("SampleRate",fs, ...
"mfcc",true, ...
"mfccDelta",true, ...
"mfccDeltaDelta",true);
% Get the number of features output per signal
features = extract(aFE,audioIn);
[numHops,numFeatures] = size(features);
% Read all audio data into memory
dataTrain = readall(adsTrain);
labelsTrain = removecats(adsTrain.Labels); %remove empty categories
dataValidation = readall(adsValidation);
labelsValidation = removecats(adsValidation.Labels);
% Extract features from all the data (assume the entire dataset uses the same sample rate (44.1 kHz).
featuresTrain = cellfun(@(x)(extract(aFE,x))',dataTrain,'UniformOutput',false);
featuresValidation = cellfun(@(x)(extract(aFE,x))',dataValidation,'UniformOutput',false);
% Define the architecture
layers = [ ...
sequenceInputLayer(numFeatures)
lstmLayer(100,"OutputMode","last") %< increased number of hidden units
fullyConnectedLayer(numel(unique(labelsTrain)))
softmaxLayer
classificationLayer];
% Define the training options
options = trainingOptions("adam", ...
"Shuffle","every-epoch", ...
"ValidationData",{featuresValidation,labelsValidation}, ...
"Plots","training-progress", ...
"Verbose",false, ...
"SequenceLength","shortest", ...%<--Specify the sequence length (try experimenting with different options)
"ValidationFrequency",20);
% Train the network
net = trainNetwork(featuresTrain,labelsTrain,layers,options);
% Evaluate performance on the validation set
y = classify(net,featuresValidation);
accuracy = mean(y==labelsValidation);
cm = confusionchart(labelsValidation,y);
cm.Title = sprintf('Confusion Matrix for Validation Data (Accuracy = %0.2f)',accuracy);
cm.ColumnSummary = 'column-normalized';
cm.RowSummary = 'row-normalized';
  2 Comments
Brian Hemmat
Brian Hemmat on 5 Jan 2021
Hi Saketh,
You'll generally receive the best results if you train using a balanced class distribution. But that's just one of many contributing factors to accuracy.
One approach to dealing with unbalanced class distributions is to use a weighted classification layer. Speech Command Recognition Using Deep Learning uses a weighted classification layer. It's a custom layer and a bit of an advanced maneuver. Also, the example uses a CNN, and I'm not positive a weighted classification layer will improve performance on an LSTM network as well.
Another approach would be to augment your dataset using audioDataAugmenter.
Another approach is to use a pretrained network. You could use something like classifySound off-the-shelf, or you could use the underlying YAMNet network and perform transfer learning for your specific task, as in this example: Transfer Learning Using YAMNet.
One other thing to keep in mind: In the code example I provided previously, I created the validation set as a percentage (20%) of the entire data set. This assumed that that the classes are roughly balanced. Usually, if you have unbalanced classes for training, you'll still want balanced classes for validation/testing to get a fair assessment (although this depends on your final application and desired performance). You can use splitEachLabel and specify the number of files to create balanced validation or test sets: Split by Number of Files.
Good luck!

Sign in to comment.

More Answers (1)

Anshika Chaurasia
Anshika Chaurasia on 29 Dec 2020
Hi Saketh,
You can also refer to Classify Urban Sound using Machine Learning & Deep Learning file containing a script to classify Urban Sound 8K dataset using Wavelet Analysis and Deep Learning.
Note: Classify Urban Sound using Machine Learning & Deep Learning is one of the several submissions in MATLAB File Exchange on MATLAB Central which is a forum for our product users to interact, exchange information and knowledge, without MathWorks' involvement. Feel free to contact the author of this submission directly for specific questions about the implementation.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!