Automated Labeling and Iterative Learning for Signals
Overview
Labeling signal data is very important step in creating AI-based signal processing solutions. However, this step can be very time-consuming and manual.
In this session, we introduce signal labeling for use in AI applications and discuss how MATLAB can be used to speed up and simplify the process. We describe the use of preprocessing to extract information from signals. The session will cover different approaches for signal labeling including using algorithms and automating with deep learning models. We will also discuss an iterative method of building deep learning models and reduce human effort in labeling.
Highlights include:
- Using and extending the Signal Labeler app
- Preprocessing to facilitate signal labeling
- Iteratively building and incorporating deep learning models
- Automating signal labeling
About the Presenter
Esha Shah is a Product Manager at MathWorks focusing on Signal Processing and Wavelets Toolbox. She supports MATLAB users focusing on advanced signal processing and AI workflows. Before joining MathWorks, she received her Master’s in Engineering Management from Dartmouth College and Bachelor’s in Electronics and Telecommunication Engineering from Pune University, India.
Recorded: 29 Jul 2021
Hi, everyone. Welcome to the session on automated labeling and iterative learning for signals. In this session, we will take a look at the importance of signal labeling and machine learning and deep learning workflows. And then we will take a look at a few different tools and techniques that you can use to perform signal labeling.
There are four steps in a typical AI workflow. And this workflow can be a deep learning or machine learning workflow. The first step is data preparation with data labeling, cleaning, and getting it ready for the next step, which is building and training the AI model. Then the third step is testing the model and integrating it with the broader systems. And finally you have deployment to the field.
Here data labeling and creating ground truth is a critical part of the first step. And it drives the rest of the workflow. So what exactly is ground truth data? Consider raw signals that we have here, which are music signals from different music genres. Now this data needs to be labeled either manually or algorithmically and verified by an expert. Verification is a key step that is required to ensure that the AI model performs well. This then completes the ground truth labeling step.
Once you have the label data, it can be used to train a predictive model that classifies the journey of the input signal. It can be also used to measure the accuracy of the trade model by comparing predicted and ground truth labels. So it may seem like AI is mainly about creating neural networks and models. However, here is a great proof point that says otherwise.
Andrej Karpathy, who's the director of AI at Tesla, said that while he was doing his PhD work he did not spend much time wondering about the data. He was focused almost exclusively on creating a new state of the art AI algorithms. But at Tesla, he spent 3/4 of his time on data sets. A very large part of which is data labeling, which is required to then develop models for automated driving systems. And you may find that you end up in a similar situation where you spent a large amount of time labeling the data, and getting it ready for the tasks you want to perform.
And signal labeling is particularly challenging because a lot of the labeling is still not completely by hand, which requires a lot of manual effort. Also, oftentimes, the correct tools are not used for preprocessing and for labeling, which makes this task even more difficult.
So we will take a look at some techniques in MATLAB that could help make this process easier. But before we look at these techniques, let me introduce you to the example that we will use in this webinar. Here, we want to label regions of interest in ECG waveforms. Now ECG waveforms are electrocardiogram waveforms. And they capture electric activity of the heartbeat.
This is a typical ECG signal. Each hardware cycle has three regions. The P wave, the QRS complex, and T wave. The output we want is a fully labeled signal. And identifying these regions is then useful to further calculate heart rate variability, detect arrhythmia and other cardiac conditions, et cetera. The data set we are using is publicly available. And the data has been labeled by a cardiologist.
Now of course bear in mind that he'll be using a biomedical signal just as an example. But the techniques that we will see are applicable to any other type of signal that you may be working with. So let's start by looking at how we can use the Signal Labeler app for labeling. This app is part of the signal processing tool box. And it allows you to view the data you want to label, and create different types of labels. You can also label your data just through drag and drop options without having to write any code, which makes it really useful.
Now I will move into MATLAB, and we can see what this app looks like in action. So I can be found here in the apps gallery under signal processing and communications. And I use that pretty often. So I have it up here with my favorites. Let's open up the script. And I'm going to first load some signals from the ECG data set. And open up the Signal Labeler from the command line over here.
Once the Signal Labeler is open, you can import your data from the workspace of files and folders, wherever you have your data saved. We loaded the data into the workspace, so let's bring that in, add in the time information over here, and import. Then you can display your signal over here. Now before we start actually drawing the labels and marking out the regions, what we need to do is create a label definition.
So a label definition basically is one place where you have your label name, type, data type, and all of this information stored. So I want to name my label ECG regions. My label type is ROI. We are labeling QRS, P, and T regions of ECG signals. And the data type is categorical. So we have QRS, P, and T categories. OK.
So the label definition is particularly useful when multiple members in a team or across teams are working on the labeling task. Having the label definition ensures that everyone is labeling the data in the same way. And this means that when you have to put all of your data together for training the AI model, it's much easier to do.
Now that we have a label definition we can start drawing a label. Because we are marking out specific regions, it's going to be easier if we Zoom into the signal. That's what the banner is going to let me do. And now I have the cutest label picked out. I can start marking out all of the QRS region in the signal over here.
And once we've marked out the QRS session, we can select the P label and quickly mark out the P labels. And we can do the same thing with the T region. And then once you finish labeling all of the regions in this part of the signal, you can move to the next window and go through the entire signal and that way. I'm not going to label the entire signal because that would take some time. But you can see how it goes.
Now the Labeler also has the dashboard view that's available. And this provides basically more insight into the entire labeling task and to your data. So you have the progress bar over here. And over hear we had a single signal, which is why it's showing 100% progress. But if you had multiple signals it's a really great way to get a sense of how far along are you in the labeling task. And if there are a certain number of ROI labels you expect per signal, then you can set that value here as well at the threshold. And the signal is marked as label. Only once you have that minimum number of regions.
And then you have the label distribution chart for the categorical labels. So you can get a sense of the different classes and what the distribution is. You can also get a sense of the length of each of the region. And all of these are really great ways to get more insight into your data set. Make sure that data is unbiased. That you have enough of all of the classes for training. And if you don't, to compensate for that, the training as well.
And all of the different labels-- so if you had other labels, you could see those as well. And you'd see corresponding charts for those. So again, this is more useful when you have multiple signals. But this is the dashboard view. And once your signal is labeled you can then export the signal, the labels signal set to the workspace. I'm going to export it and show you what the label signals that looks like.
So this is the label signal set. And you have now the source, which is the ECG signal. And the label;s in the same place. And you can see the ECG regions as a table. And this is because we have the ROI limits and the corresponding values. So if you had multiple labels again, you would see all of them in this table over here. So this is a really great way to manage a large amount of signal data and the corresponding labels. And it's a really easy way to organize and manage data.
So now that we've seen this, let's go back to the Signal Labeler app. Now of course over here you've taken a look at a single signal. Right? But if you were actually performing any other task it's unlikely that you would have a single signal. It's far more likely that you would have a data set with multiple signals. So now I'm going to clear this single signal and bring in a data set with more of the unlabeled signals. And now you can see that when you have multiple signals, it would be a tedious job to go in and label all of these one by one even though we have the visual aspect.
So next, we'll take a look at some techniques to start automating this process. One other thing that I do want to talk about is in this case, we have used the time domain signal to take a look, find the features, and label them. But if you are looking for features that are easier to find in the spectral domain or the time frequency domain, then you could make use of those different views up here. So you can take a look at the power spectrum or the type frequency map, and directly draw labels in the time frequency map as well. So that's another really interesting way to make use of the Labeler to find the features and then label them.
So then the next thing that I'm going to show is automating the labeling process by using algorithms. And here, going to use the Pan-Tompkins algorithm, which is a popular algorithm in the biomedical space. And it is used to identify our peaks. Basically how this algorithm works is it takes the raw ECG signals, apply some filtering onto it to remove the noise and to emphasize key features, then we perform squaring and averaging on the signal. And finally, we identify the R peaks and the QRS points.
And once we've done this, we can identify the entire QRS region. So we will see how this is being used and we will use it directly inside the app. So we're using the Pan-Tompkins algorithm. Now the way to integrate the algorithm into the Labeler is by adding custom functions. So we can add a custom function like the Pan-Tompkins algorithm. Over here, we could set the label type. Of course the custom functions we add have to follow this particular syntax. But they can be brought into the Signal Labeler, And we can use them directly.
So before I bring in the function into the app, let me show you what this function looks like. So this is the Pan-Tompkins function. And as you can see, we are returning the label locations and label values. And this is the format that is required by the Signal Labeler app. And here we are performing all of the same steps that we just saw. So we are performing filtering. We are doing the squaring and integration finding the R peaks and the QRS points. And then returning the locations of the entire region as well as the corresponding value.
So this is the function that we are using. And now let's go into the app, and add that custom function, and do that. So you put in the name of the function over here. And you add the label type. Now here making a ROI. So we do that. And then we do-- OK. So now the Pan-Tompkins algorithm is available for us to label all of these signals. So we can auto label all of the signals at one time, or we can do a single signal at a time and then check the labels.
Here, I'm going to auto label all of these signals. So we have 25 signals. And we'll apply the Pan-Tompkins algorithm to all of them. Basically, in a few seconds, the QRS region and all of these 25 signals were labeled automatically. Then we can go in and label the P and T regions by hand quickly in all of these signals. And then once the entire labeling task is done, we can export the label signal sent into the workspace.
The labels that are added by algorithms can also be verified in the app. And like I mentioned earlier, verification is an important step because verification of the labels basically determines how well the AI model will be able learn. So once all of this is done, you end up with a label signals set. Now obvious question over here is that we used an algorithm to find and label the QRS regions. Why didn't we do the same thing for P and T regions? That would automate the entire process.
And the reason we didn't do that is because the P and T regions are more challenging to find using traditional signal processing techniques. So the Pan-Tompkins algorithm works really well, and finds the QRS regions pretty accurately in all of the signals. However, there isn't an equivalent algorithm that works really well and works very accurately to find the P and T regions, which is why I did not use any algorithm over that. But if you do have an algorithm that can identify all of the different features in your signals, then you can bring that algorithm into the Labeler app and apply it to your signals.
In that case, the Labeler app is still really useful because like I said, you are able to verify your signals after that point, and correct them if necessary. So this brings us to the end of our second technique, which is labeling using algorithms. Now let's go to the last step, which uses a form of automated labeling but iterative learning.
Here we are going to use deep learning to label the signals. Now this seems a little circular that you are performing labeling due to deep learning. And we are using deep learning for labeling. But let me show you how we can do this.
Now if you had let's assume a significant amount of data that was already labeled, you could use your label data to train a network. Then once you have a trained network, you can put a new data that is unlabeled through this network, and you will end up with label data. But of course, the question is, what do I do when I'm starting with completely unlabeled data? And what if I don't have any data that has been labeled at all?
So in this case, you can still use deep learning. You can use it with the iterative learning process. Let me tell you how this works. Let's assume you have a very large amount of unlabeled data. And the red here just indicates that the data is completely unlabeled. You can take a very small subset of this data and label it by hand. Verify all the labels. And once this is done, you can use this data to train a deep learning model.
Once your model is trained, you take a larger subset of the unlabeled data, and then pass it through the trained network. And label your data. Now because we are using very little data in the training step, the network performance is going to be coarse. And the network may not perform the labeling very well. However, it will still give you a starting point.
Now once you have this data, you may still have some features that are not labeled. And you may have to go in and correct these labels. And that brings us to the next step, which is the data. Once the data has been labeled by the network, it needs to be verified manually and corrected. Once this is done, you can add the data back into your training data set. And you can now use this. New training data set with the added data to retrain the network.
And this completes one iteration. You can do multiple such iterations. Also the human effort your exchange from having to label your data from scratch. To correct the labels that the network provides. And as you go through more and more iterations, and more and more data is added to the training data set, the amount of corrections required produces. And this is really important. And we will see how impactful this is and how much it reduces the human effort.
Now before we go into that, let me talk about one other step, which is very important for deep learning. And that is preprocessing signals. So raw signals can be applied directly to deep networks. However, when your data signals this doesn't work very well because the data has high variability and dimensionality. Now in this case, it helps to have some form of preprocessing whether that be a time frequency transformation or some form of feature extraction. And the features can be anything. That can be spectral measurements, peaks, or it could be different features like that frequency maps. But inputting this to the deep network makes a significant improvement. And in accuracy of the network. Because this means that the network that not only relevant features.
And we've reduced the dimensionality and variability of the signals. And this is really powerful. This is not a treated workflow. And the preprocessing comes in the step before we feed the training data into the network. So now let's get to our final demo.
Here we will start by loading in the entire data set. I have resized the original signals and do frames of 5,000 sample points. This is because the original signals were very long. And when we input very long signals into deep networks they do not train as well, which is why I resized them. And I've also divided the data into a test data set and a training data set.
So the training data set has over 6,500 signals. And the test data set has a little over 2,800 signals. Now the training data set is what we will use for the iterative process. And if we were doing deep learning without the iterative process we would start by labeling all of these 6,500 signals. But by using the data process we will see how we can reduce the number of signals that we need to label significantly.
And we will be using a subset of the test data set to check the accuracy of the trained AI model as we go through the iterative process. Before we start with the iterative process, and the iterative learning, let's take a closer look at the actual ECG data itself. And we'll use the signal analyzer to analyze that data, and figure out what the best preprocessing and feature extraction techniques might be for our data set.
The signal analyzes a really useful tool to get started with basic signal exploration so you can view your data at the time domain spectral or time frequency domain. You can apply a few different preprocessing techniques on it, and try and figure out what technique works best for your particular data. Now over here, we will take a look at the power spectrum of the signal. And you can see that there appears to be a spike at this very low frequency.
The signal, in general, appears to be a little noisy. And there is a drop off in information after about 50 hertz. But in general, the signal is noisy. And biomedical signals can be quite noisy. And this may even be true for signals in your application where the SNR ratio might be really poor. And there might be a lot of external noise as you are capturing your signals. So what we want to do is figure out where the noises and where the important information is so that we could eliminate the noise. And make sure that as we use this data to train our model in the iterative process. The model is learning the relevant features.
So in order to figure out where the noise and information is, we need to take a closer look at the power spectrum of an ECG signal. And this is where domain knowledge is really useful to quickly figure out what techniques will work for your data. So in purple you can see the power spectrum of the ECG signal, which looks similar to what we are seeing with the test signal. And the QRS complex and P and T waves typically lie below 35 hertz. And at this very low frequency of maybe 0.5 or 1 hertz, we see the spike of low frequency noise.
This comes in due to oscillatory motion that happens as the patient is breathing when the ECG is being captured. And then at these higher frequencies you have muscle noise that comes in because the patient may move when the ECG is being captured. And we want to make sure that we remove both the low frequency and the muscle noise, and keep only the relevant information.
So this information is going to help us to design a filter and apply it to our signal. Now I can apply the bandpass filter from within the app. So I'm going to create a bandpass filter from 0.5 to 35 hertz. And instead of having my original signal overwritten, I'm just going to quickly create a duplicate. And create another display. Display the duplicate there. Create the power spectrum for the signal. And now apply the bandpass to this signal so I can compare with the original.
And you can see that there appears to be spikes at the end. And this particular filter doesn't do a great job at filtering the signal. So I'm going to undo the processing. And instead, I am going to apply a custom function over here, which is a bandpass filter that I designed using the design fill function. And I'm going to use that to filter my signal. And I'm going to use this.
And now you can see the signal looks really good. The slight turn from the original signal is removed, and the high frequency noise has been eliminated as well. So this is how you can use a signal analyzer to try out a few different techniques, or bring in your own custom functions and algorithms. And apply those for preprocessing and extracting features from your signal.
Another great way to represent data is to use time frequency maps. So time frequency maps basically provide information on how the frequency content in the signal changes with respect to time. And this gives more insight into the signal than using the time or frequency of you alone. What you see is a spectrogram that's created using the short video transform. Or you can use the scalogram, which is a time frequency map created using the continuous wavelet transform.
Over here, in this case, we will use the Fourier synchrosqueezed transform to create the time frequency map. And we are using the Fourier synchrosqueezed transform because it provides one spectral estimate for sample point. And because we are performing sequence to sequence classification, we want to ensure that the signal going into the deep network is the same size as the original signal. And the Fourier synchrosqueezed transform allows us to do that.
Now let's get into the first step of the iterative process. And we will manually label 25 signals. As we've seen above, our total data set has about 6,500 signals. And I'm going to take just 25 of those signals, and label them using the Signal Labeler app. Now I'm going to import the signals over here into the Labeler. And I have also saved the label definition from earlier, and I'm going to put that in.
Now that we have odd signals and the label definition, we can actually get started with labeling the signal. It is possible to manually label all of these 25 signals. Or if you have an algorithm that can help you get a starting point, it's great to use that as well. Here we have already the Pan-Tompkins algorithm. We know that it will work to find all of the QRS regions. So let's use it to find the QRS region in all of these signals.
And once we find the QRS regions and those are labeled, we can go in and label the P and T regions manually. Once that entire process is done, we can export that labeled signal set into the workspace. And we could use it for training. Here, I'm not going to label all of the regions because I have done that beforehand to save some time. And now we have the label training data that is available.
So we move to the next step. There we will perform the preprocessing. Like we saw, we are going to use the Fourier synchrosqueezed transform. And we will train the initial network. So here we are performing the feature extraction. And then we will create a deep network. So here we've created BiLSTM network. BiLSTM networks or LSTM of networks work really well for a time series data like signal data over here, which is why we're using that. And It's a pretty simple network that I've created over here.
I'm not going to go into a lot of detail on the layers and the hyperparameter options. The training options. But you could refer to the deep learning documentation to get more a better idea about these different parameters. And I have done the training beforehand as well to save us some time. So I'm loading in the results as well as the training progress chart over here.
So you can see that the network accuracy improves as the network trains. And on the bottom of the chart, we have loss, which can be thought of as a penalty for misclassification. And that drops as the network accuracy improves. So with our initial training network, the prediction accuracy of this initial network comes out to about 82%, which is a really good accuracy to get as a starting point. Because at this point, we've labeled only 25 signals from the entire data set.
So now we will move on to the next step where we will take a larger subset of our digital training data, pass that through our initial network, and then manually verify and correct the labels wherever required so that we can add it to our training data. So first, we will create a subset from the training data. So now I'm going to be taking of a 200 signals for this next step. And I'll open the Signal Labeler again. I'm going to clear all of the members that are loaded in previously, and I'm going to load in this new training set, which has the 200 signals.
And once I open it, and then I can label these signals using the initial trained network. Now the way that I'm doing that is by using a custom function that I have written. This function I'm going to open it up, and you will see again that it has the same format as is required by the Signal Labeler app. And this function takes in the signals, and uses the initial network that we train, and uses that to find the QRS, P and T regions. And label the signals.
And then it returns the label values and label locations back to the signal label. So this is the function that we are using. So I can now add my custom function, which is find ECG regions. And again, I am labeling ROIs. I can click, OK. Now here instead of labeling all of the signals at one time, I'm instead going to label one signal. Then go through the signal, verify the labels, correct them wherever required, and then move on to the next one.
So to do that, I'm going to click on auto label and then run. And apply the algorithm to this one signal. So we can see over here that 62 regions are found basically in our entire signal. That is 62 regions of all three types. And we will again use the banner to Zoom in. And now we can correct the labels wherever they are incorrect. And we can see over here that the P label is not really correct. So we can then use the app to manually correct those labels.
Now a metric that we are going to be using to check on the effort required to make all of these changes is number of samples collected. And what I mean by that is we had seen earlier that we've created frames of 5,000 sample points each. And then I corrected this particular label of P regions. I could have been correcting the labels for 50 or 60 sample points. So even though I'm correcting one region, I am correcting maybe 50 or 60 sample points.
So the intention is not to use the number of sample points as an absolute representation of effort. But it is to give a general idea. Just to put that in another way, I checked how many regions I needed to correct in this one signal. And out of the 62 regions that we saw that were labeled automatically, I had to go in and correct maybe 25 or 30 regions. And I did that for an entire signal, it took me about a couple of minutes. So the number of samples is not an absolute value. But it is to give a sense of the overall effort that is going in.
And the reason we use number of samples is because it is easy to calculate as we go through the entire iterative. process. So once we've corrected all of the labels over here, we can then save the labels, and go back in. And we can continue to do that for all of the different signals in our data set. And then export the data back into the workspace. And again, over here, you can see that the task is different. The task changes from us having to label each individual region from scratch to correcting the labels that are already there, which reduces the human effort.
Now we'll quickly see what the average number of samples collected will frame in our data set of 200 signals. And the average number of samples collected was 1,029 OK. So that completes our first iteration. Now we go on and perform multiple such iterations. The number of iterations you need to perform depend on the accuracy that you're looking for your particular application. In this case, I performed 15 such iterations. And again, I'm just loading in the final result of the entire iterative process over here. And the accuracy of the network that comes out of this iterative process is 93.4%.
Now a question that might come into your mind is would we have gotten a better result if we had labeled all of the data to begin with, and then use that for training purposes? So I tried that as well. And the accuracy of that network came out to 94.1%. And like you could see, that is hardly any difference in accuracy and the amount of training data required in the iterative process was half of the complete training data set. So of the 6,500 signals of the training data, I used only half in that rate of process. And in the second case, I used all of it.
So in terms of the accuracy, and reduction in human effort the iterative process really pays off. Now a couple of other charts that I would like to show you is the accuracy as we move along in the iterative process. So here we've marked the accuracy of the network after each iteration. And you can see that we started at around 82%. But we had a steep increase in accuracy. And with only 700 signals in our training data set, we moved up to over 90% of accuracy. So we really saw a fantastic improvement in accuracy through the iterative process.
Now the other chart that I want to talk about is the number of human corrections that were required at the iteration. Again, you can see that we started at 1,029 samples per frame. But by the end of the process the number of corrections required by frame were down to only 460 per signal. Now here, I have the final results. And we will use our network that was trained through the iterative process to label a signal. And you can see that it does a really good job at labeling the signal overall.
So that brings us to the end of our presentation, I'm going to quickly recap all of the different techniques that we've covered in the session today. We took a look at the Signal Labeler app. We saw how to visualize our data, add label definitions, and use custom functions to automate the labeling process. We also saw, finally, how to use an iterative process to automate labeling with deep learning.
So we had seen at the beginning the different challenges that you face with labeling. Now we've seen different ways that MATLAB can make this easier. So you can use AI and apps in MATLAB to reduce the manual effort and address the lack of labeling tools. And you can use a lot of application specific algorithms that can make both the labeling, as well as training the network parts easier.