Clear Filters
Clear Filters

Classification task for speech competence in Deep Learning context

1 view (last 30 days)
Hi!
BACKGROUND: I'm working on an interesting project where our goal is to create an algorithm that can help speech therapists in their work with children with cleft or lip palate. They often suffer from speech difficulties which are reflects themselves in the spectral representation of the audio signal. We have a labeled dataset of 150+ recordings per child containing around 3 to 5 minutes of speech. The speech content is non-homogonous since they the children doesn't speak the exact the same sentences etc. se we were not able to separate seperate it by phonems or sounds.
CLASSIFICATION: The recordings are labeled with the speech competence 'score' on a scale 1, 2 or 3 (lower being better) which corresponds to the speech therapists holistic assesment with regards to the entire recording. It's often quite easy for the un-trained ear to differentiate from a 'one'/'two' compared to a 'three' but it's alot trickier to differentiate between a 'one' and a 'two'.
OUR NETWORK: We extract audio-featuers (e.g. mfcc, gtcc, pitch and other spectral features which reflects e.g. hyper-nasality) on a 0.025ms frame basis and concatinate frames into sequences of 1-2s (i.e. 25-50 frames) which are each labeled with the corresponding speech competence score and have then built a biLSTM-network for the sequences. The prediction for a recording in the validation set is then the most frequently occuring predicted score for the sequences of that file.
OUR PROBLEM: As mentioned above, it's hard for the un-trained ear to tell a difference between a 'one' and a 'two'. This is largely because what separates a 'one' from a 'two' is just some minor speech 'errors' which may not be present through-out the recording as a whole. For a 'three' on the other hand, there is a structural difference present in most of the file which makes it quite easy to spot. This is reflected in our confusion matrix on a segment-level basis where alot of 'two' segments are classified as 'ones' and vice-versa.
OUR SOLUTION: Our crafted solution to the problem described above is to train a network on only 'ones' and and 'threes' and for this problem we get a validation accuracy of above 92% on a segment basis. In other words, we have a network which accurately tells a different in speech competentce for the worst-case and the best-case.
OUR QUESTION: How do we best go about classifying the 'two' category? When we use the network that is trained to spot 'three' or 'one' segments on the 'two' files (which this network has not seen) we can clearly see that the % of segments in these files being classified as 'three' is somewhere between the very % of segments in 'one' files being classified as 'threes' and the very high % of segments in 'three' files being accurately classified as 'threes'. In which way can we use this information? Is a random forest something which could be applicable here?
Many thanks for taking the time to read this,
Best,
Joel

Answers (1)

Shubham
Shubham on 21 May 2024
Hi Joel,
Your project is tackling a complex and socially impactful problem, and it sounds like you've made significant progress. The challenge of accurately classifying the 'two' category, given its subtlety, is indeed tricky. Based on the information you've provided, here are a few suggestions on how to approach the classification of the 'two' category, including the potential use of a Random Forest classifier:
Step 1: Analyze biLSTM Output
First, thoroughly analyze the output of your biLSTM network for the 'two' files. You're looking for patterns or characteristics in the segments classified as 'one' or 'three'. This might involve:
  • Calculating the percentage of segments classified as 'three' for each file.
  • Assessing the distribution of scores across segments within each file.
Step 2: Feature Extraction for Meta-Classifier
Based on your analysis, extract features that could help differentiate 'two' from 'one' and 'three'. Possible features might include:
  • The percentage of segments classified as 'three'.
  • The variability or standard deviation in segment classifications within a file.
  • Any temporal patterns in the classifications (e.g., sequences of 'three' classifications).
You can use MATLAB's built-in functions for statistical calculations to extract these features from your biLSTM's output.
Step 3: Prepare the Dataset
Prepare a dataset for training your Random Forest where each instance represents a file, using the features extracted in Step 2. Label the dataset based on the known classifications ('one', 'two', 'three').
Step 4: Train a Random Forest Classifier
Utilize MATLAB's TreeBagger function to train a Random Forest classifier. The TreeBagger function is part of MATLAB's Statistics and Machine Learning Toolbox and is well-suited for classification tasks. Here's a simplified example of how to use TreeBagger:
% Assuming X is your feature matrix and Y is your label vector
RFModel = TreeBagger(50, X, Y, 'Method', 'classification');
In this example, 50 denotes the number of trees in the forest, X is your feature matrix where each row is an observation (file) and each column is a feature, and Y is the vector of labels ('one', 'two', 'three') for each observation. You can read about it more on this: https://in.mathworks.com/help/stats/treebagger.html
Step 5: Validate and Adjust
After training your Random Forest model, validate its performance using a separate validation set or through cross-validation. Evaluate the model's ability to correctly classify the 'two' category, and adjust your feature set or model parameters as necessary to optimize performance.
Step 6: Integration and Testing
Integrate the Random Forest classifier with your existing workflow. This might involve:
  • Running your audio files through the biLSTM network to get the initial segment classifications.
  • Extracting the features from these classifications for each file.
  • Using the trained Random Forest model to classify each file as 'one', 'two', or 'three'.
Additional Tips
  • Feature Engineering: Spend time on feature engineering based on the biLSTM output. The quality and creativity of your features can significantly impact the performance of your Random Forest model.
  • Model Tuning: Experiment with different parameters for the Random Forest (TreeBagger options) and the number of trees to find the best model.
  • Cross-Validation: Use MATLAB's cross-validation functions to assess the generalizability of your Random Forest model.
This approach leverages the strengths of deep learning for initial audio processing and feature extraction, while utilizing classical machine learning to handle the nuanced classification task, all within MATLAB's robust computational environment.

Categories

Find more on AI for Audio in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!