Predictive Maintenance with MATLAB: A Data-Based Approach
Overview
Do you work with operational equipment that collects sensor data? In this seminar, you will learn how you can utilize that data for Predictive Maintenance, the intelligent health monitoring of systems to avoid future equipment failure. Rather than following a traditional maintenance timeline, predictive maintenance schedules are determined by analytic algorithms and data from sensors. With predictive maintenance, organizations can identify issues before equipment fails, pinpoint the root cause of the failure, and schedule maintenance as soon as it’s needed.
Highlights
- Accessing and preprocessing data from a variety of sources
- Using machine learning to develop predictive models
- Creating dashboards for visualizing and interacting with model results
- Deploying predictive algorithms in production systems and embedded devices
- Using simulation to generate data for expensive or hard-to-reproduce failures
About the Presenters
Russell Graves is an Application Engineer at MathWorks focused on machine learning and systems engineering. Prior to joining MathWorks, Russell worked with the University of Tennessee and Oak Ridge National Laboratory in intelligent transportation systems research with a focus on multi-agent machine learning and complex systems controls. Russell holds a B.S. and M.S. in Mechanical Engineering from The University of Tennessee and is a late-stage mechanical engineering doctoral candidate.
Recorded: 26 Oct 2021
Hello. My name is Russell Graves. I'm an application engineer with MathWorks. I have a little over a decade of experience working in and around our tools through two mechanical engineering degrees at the University of Tennessee with a third in the works. Throughout this time, I've worked with several machine learning and pattern classification, pattern recognition techniques, which really lend themselves to predictive maintenance workflows.
And that all said, I really want to start today by answering two questions. First, why would we perform maintenance, in general? And then second, what would make that maintenance predictive?
To answer the first question, take a look at a quick example here. So the braking in this-- braking system in this wind turbine has degraded to the point of failure. In the video, high winds are causing these blades to spin much faster than the designers ever intended or anticipated. And this results in a catastrophic failure.
And running to failure is going to really segue us right into what would we define as predictive maintenance. And to get there, we're going to break up maintenance into three really broad categories, first of which is letting it run to failure.
So reactive maintenance-- we just witnessed a pretty poor example of this, where we wait until something breaks in order to actually fix it or perform the maintenance. And to try to avoid these big, cascading failures where one component, the braking system in that turbine, lets go and now the whole thing tears itself to pieces to the point where it can't be fixed-- to try to avoid these situations, we might adopt a maintenance schedule. And I think that this is probably pretty common, where we're going to have a crew climb the turbine every few weeks or every few months and service it-- replace parts, inspect things, so on and so forth.
And this-- in all honesty, the turbine we saw explode probably had a maintenance schedule. And that begs the question of, well, could we have done something different or could we have more intelligently chosen to perform maintenance or repair that braking system or how could we have more accurately made that maintenance decision? And these sorts of questions are where we start breaking into predictive maintenance.
So it becomes predictive when we're using data in order to predict when things are, as the name suggests-- when things are going wrong and, really, more intelligently make maintenance decisions. And the followup question to that is, well, what could this look like? And I can show you an example here where we used MATLAB to both build out a predictive maintenance dashboard and then also deploy it to a web application. And here, we're going to pivot from the turbine example to pumps.
So now we have a series of 10 pumps that are sitting somewhere out in the field, moving fluid back and forth. And on the panel here, I am really-- I can really easily tell what MATLAB predicts the remaining useful life for each pump to be. And I can also see several pumps that are highlighted in red, indicating that they're in need of more immediate attention.
I can go a little further than this. And if I click on the By Pump tab and then I drop down here and go to pump 3, I can see that, oh, yeah, MATLAB has actually seen that this pump in particular is operating with a bearing fault.
Now, left unattended, in a very similar cascading effect, that bearing could let loose some little, tiny metal shavings into the inner workings of this pump, therefore causing a catastrophic failure of the rest of the pump, rendering it completely irreparable and in need of replacement in order to keep our facility moving or whatever, whatever these pumps are pumping.
So that's an example of what such a solution could look like, or at least part of it, the dashboard, the interface. And we're going to spend the rest of today going through some basic predictive maintenance concepts. And then we'll dive into an example actually looking at the pumps in the dashboard. And we'll put together the underpinnings for two of those dashboard components, remaining useful life and then which fault we're experiencing-- so in that case, the bearing fault.
And then I'm going to point you to some resources that are meant to aid you in any predictive maintenance endeavors you're going to take on in MATLAB. And then, finally, we're going to have a Q&A session where we're going to answer-- my colleagues and myself will answer-- field some of the questions you have. So feel free to put those in the side-- in the Q&A panel in the meantime.
Now, before we get too into the example, I want to go back and take a step back and cover what exactly predictive maintenance algorithm is doing. And I mentioned at a very high level earlier that we're making more intelligent decisions with the data. But in between these two endpoints, we're looking at three distinct questions. Is the machine operating normally? If not, what is wrong with it? And then third, how much longer can I expect the machine to operate? And these tasks-- we define them as anomaly detection, condition monitoring, and remaining useful life estimation.
So to answer these three questions, this is what we're going to look to. So the first two boxes here are data acquisition and pre-processing steps. Those are the most common steps across many data analysis or engineering workflows. And they also happen to be the most painful or arduous part of this process.
Now, fortunately, MATLAB and its supporting toolboxes offers a ton of functionality to help you get through these two steps. It won't be-- it's not magic. But it will help you significantly reduce the pain, make that a lot easier, and enable us to reach the third step and the fourth step much, much faster with, again, less time spent and less of your hair pulled out.
So this middle area here is where we get to the fun part. This is where we're going to try to interrogate the data to find some interesting relationships and really see what we can do from a predictive maintenance standpoint. And today, and in many cases, when we're talking about training models, we're going to talk about machine learning because machine learning techniques are great at picking out subtle nuances between healthy data and not so healthy data. And just like we saw on the dashboard earlier, MATLAB was able to tell us that that-- even though that one pump, pump 3, had 40 days of remaining useful life estimated, it was operating with a bearing fault. So that's something we caught really early on.
This is an iterative process. So if we fail to meet our requirements or get something accurate enough or good enough the first time, we're just going back and identifying some new features or picking out some new stuff, adding that in, training a new model, and rinse and repeat until we get something we're happy with. And at that time, we can deploy it. And this final loop creates a living workflow, where we're continuing to acquire data from our device over time and then use that to improve the accuracy of the model.
That said, now let's dive back into the pump and look at this again. And so in the case of the pump, we have some label fault data. And what we want to do is differentiate between different combinations of three possible faults-- so for a total of eight fault configurations.
And in this case, we're only going to use the recorded pressure and flow data from some sensors embedded in the output of the pump. And just as an example, this is the pressure data for a single operational cycle for one of these pumps, where we have some transient behavior, which we removed during the pre-processing step. And then we have-- are left with some steady-state behavior that we can analyze.
And so now let's take a look at answering that second question. So again, classifying which combination of or fault we are encountering, this condition-- this issue of condition monitoring-- and let's take a look at how you might do this inside MATLAB.
So we're in MATLAB. And we have our data here. So if we look at this, we can see that this is simply a table of timetables. Each row here corresponds to a single work cycle for the pump. And for example, here in row 4, if it was operating with some fault code-- so we've codified those three faults-- the bearing failure of the seal leak and the inlet blockage.
And we've got that here stored. So this is our ground truth variable that we're going to be looking at when we're trying to develop the classification technique or train that classification model. And in these first two columns, we have the pressure and flow data. So these are all stored in timetables. And each of these is just a time series of data inside MATLAB.
So our data is in. And we've-- it's all ready to go. So the next thing we want to do is start digging through features and identifying those condition indicators. And we have an app that's called the Diagnostic Feature Designer, which is going to help us really easily and quickly try to differentiate features, explore the data, and actually select those condition indicators to feed into our machine learning model.
So we can go up to our apps browser here and find the Diagnostic Feature Designer and open that up. And we can go ahead and start a new session. And this is where we're going to select our data. So in this case, we have this pump data in memory. And we can see that we're importing the flow data, the pressure data, and our fault code, which has been recognized as this condition variable. So it's the variable of interest.
Once we've got everything set here, we can go down and click Import. And that's going to bring our data in. So over here on the left-hand side, you can see all the available signals in this little browser here. So we have both our flow data and our pressure data ready to go. And we can then go up and start investigating.
So if we click on Flow Data here, we've got some options. We can either get a summary of that signal or we can get all-- trace of all of the work cycles. So this is going to plot all of those different cycles in our data set. And then we can group those by fault code.
And really, this is a mess. And this tool is going to help us parse through some of this. We can zero in down here using the panner on any area that we think might be of interest. But even zooming in here, this is still pretty hard to interpret, at least to my eyes. The only thing we might be able to glean here is that there appears to be some periodicity within this data set.
So instead of going cross-eyed staring at this any longer, let's go ahead and start to try to generate some features. So I can select that flow data. And let's get some time domain features out of the way. And then we'll come back for the periodicity that we saw.
And we can click on this Time Domain Feature dropdown. And you can see that there's a couple of different domain-specific options here. But we just want Signal Features. And in this pane here, we can go through and select any other features that we want. And then we click Apply. And it's going to just go ahead and calculate all of those.
So again, if you're really-- either case, whether you already have an idea about which features you want to investigate or which features might be of use to you or you aren't really sure and you just want to throw some spaghetti at the wall and see what sticks, we provide options for both of those here. And it's really nice to be able to just browse through all possible features.
And when they're done generating, you can actually go back here and click on this feature table view. And similar to the data that we saw previously, each of these rows is a different work cycle. And we can see the fault code that was associated and then all of the features that were generated for that specific cycle of data and the associated scores. And we'll come back to the scores in a minute.
So we did notice that there was some periodicity. I have time domain data here. And to move to a frequency domain where I can actually investigate some of those frequency domain features that might be present, I have to do a power series-- a power spectrum estimation here.
So in this case, I'm going to use an autoregressive model to generate the power spectrum for my flow data. And we're going to use a model order of 20. And I will hit Apply. And that's going to, again, calculate our power spectrum and allow us to go and dip into the frequency domain here and do some spectral analysis.
Once this is done, we should have a nice, pretty plot of our power spectrum for the flow data. I can change the scale and, again, group by fault code, allowing me to glean some-- maybe glean some information here from the plot.
In this case, I'm going to-- again, I'm not sure where to look, maybe. So I'm going to go back and go to our Feature Designer, click on-- now we have this power spectrum data, which we've created in our signals and spectra. And I can go up here to the Frequency Domain Features dropdown.
And again, you're going to be greeted with some domain-specific options. But I'm just going to do the top one here. And we can-- same workflow-- hit Apply here. And it's going to go through and generate all those features for us. And that way, we can see, well, at this point, we want to know, or we would have a ton of features. All those just went into this table.
So the next question-- or some of you may have already noticed the Rank Features button up at the top right. That's the next stop. We have all these features now. How do they stack up? How well do those features actually describe the differences between those fault codes? Which ones should we use, because some features are going to be better at explaining some of the faults as compared to other features?
And in cases like these, really, there's often diminishing returns to the point where you're-- you really don't want to keep all of the features. You want to isolate the features that are-- that matter the most. And so this is really going to enable you to do that in a much more interactive and quicker way. And so in this case, we're ranking these by fault code. And we're using a one-way ANOVA ranking.
Once we have all the features we're interested in and we're happy with a list, we can go up to Export and choose what expert option we want. In this case, I'm going to generate a function so that I can repeatedly get these same features out of a similar data set if I were to collect more data from these pumps later on down the road.
So I'm going to click this Generate Features for Functions. And I can select the ranking algorithm I want to use and the number of features. You oftentimes don't want to take all the features because, again, as I mentioned, there are some diminishing returns. And so today, I'm just going to take the top 10.
You can hit OK. That's going to give us this function, which is yet unsaved. And I can name that Generate Features or give it some more descriptive title. And then I can use that in a script.
So I already have a script here where we've done exactly that. So right here, we have the Generate Features function, which has been automatically created using the Diagnostic Feature Designer. And all we're going to do at this point is we want to go ahead and partition our data for testing.
So we're going to withhold some of that data that we have so that later, when we go to evaluate the performance of the classifier that we'll train next, we don't give it the same questions on both the practice exam and the actual final. You don't want to do that. You're just like a student. Your algorithm here could cheat.
And the next thing we want to do is go ahead and open up our Classification Learner app. Now, this app is going to let us more quickly and easily go through even some classification techniques we may not be familiar with and explore which one is going to provide the best results for our specific needs.
So same thing-- we hit New Session. Our data is in the workspace. So we're going to click on this. And then we can go in and select our data here.
So we've got our Training Features and Label. So we've reserved that testing set for later. And we're greeted with this scatterplot and a couple of options here. And really, again, I'm not sure which classification technique is going to be the best for us. So in this case, I'm going to turn on some-- use our parallel processing to make this go a little quicker.
And I can just go over here and hit All. And we can just hit the data with a stick and really see which of these classification techniques is going to best fit this data, at least just at a glance.
So in this step, we're exploring. We're going to train all these different models, and we're going to see which one performs the best. And that may guide our selection down the road or guide where we put our efforts in making this classification algorithm.
So you can see on the left-hand side here-- I'll zoom in briefly-- we have the name of the classification learner type here. And then on the right, we have the accuracy and the number of features that it used to achieve that. And this is validation accuracy. So again, we-- it's just on the training set, not the testing set.
And we can actually go back up here and sort them by validation accuracy. And we can see right off the bat that this top option here is an ensemble type learner, Bagged Trees. It has a validation accuracy of 81.8%.
Well, that's-- depending on your application, that may be great. That may be not so great. But we can dig in a little deeper and see exactly what we're talking about when we say 81% accuracy. We can click that. And then over here, we can hit Confusion Matrix. And we can-- this again-- this confusion matrix-- not meant to confuse you or confuse me. It's meant to show when the classification learner was actually confused.
If everything was perfect, we'd have 100s down the diagonal. But that's not the case. In fact, we can see here in this fourth row that the-- while the true class was failure mode 11, which, again, is a tokenized version of our-- of combinations of our three failure modes, it predicted this correctly only about 35% of the time and misclassified that as just failure mode 1 about a quarter of the time. And you can start to really dig in and tease apart maybe what features-- again, going back to that feedback loop, what features we could include or little tweaks we might be able to make to try to improve that accuracy.
Now, that's if we're not happy with this data. If we are happy with this accuracy, then the next thing we really want to do is, again, we don't want to just trust the validation data because this is all operating on the training set. And this algorithm could be cheating. We need to double-check.
So we're going to go to our testing data. And up here, we have a Testing Data button. And we go to From Workspace here. And we're going to select that testing set that we reserved and import that.
And in fact, really brutally, I'm just going to test all of these because I want to know how did each of these things perform on the testing set. And on the left-hand side, now you can see that the accuracy type has switched from Validation to Accuracy Test. And in fact, that same Bagged Trees algorithm is performing at 83.3%, which is that-- which is the highest accuracy of any of these learner types.
In the case where 83.3% is right on the money for our accuracy requirements, we can go up here to our export options. And we can-- a couple of things we can do. We can either generate a function-- so again, going back to the function generation repeatability, in this case, we're generating the training function. So it'll create a function that will train the same model over and over again, talking about that living workflow and continually refining your model, making it better and better.
And we can also export the model. So in our export options, we have exporting the currently trained model, a compact version, and then a version for deployment.
So we're going to hop back over to PowerPoint now. And we just got done designing our fault classification algorithm training that model. So we've gone through this little piece of the feedback loop. And the next thing we want to do is let's look at how we might answer that third question.
So now that we've classified the fault and we know why our machine is failing, now we want to know how much longer will this machine with the pump continue to operate. So we're going to do some remaining useful life estimation and take a look at what that might look like in MATLAB.
But first, we need to decide what-- exactly the type of model we want to use to estimate the remaining useful life. There's a couple of different options. It really depends on what type of data you have access to.
So in the first case, where we have complete run to failure data-- so we've set a bunch of pumps out somewhere and run them until they failed. Then we can look at the data from the failed pumps and then compare it to the data that we're seeing and, apples-to-apples, say that our pump looks like it has so long left based on what we've observed in the field.
Now, if we don't have run to failure data because often it's very costly to actually let one of these pumps run to failure-- again, often, we're dealing with taking data from things that are on maintenance schedules. We might have some data about accumulated degradation. But we don't have any failure data. We still want to make an estimate as to-- make an intelligent decision as to when we service the pumps in this case. So we might arbitrarily assign a safety threshold and use the existing degradation data we have to train a model.
And the third option is if we only have data about survival rates for similar pumps. So for example, if I knew that a similar type of pump to this failed after-- I don't know-- 100 operational hours and our pump had 70 operational hours, then we would say that we had about 30% remaining useful life, something like that.
We have a much more detailed flowchart on this documentation page I'll specifically call out that will guide you through the process of selecting which type of remaining useful life model that you need for your specific application. And in our case, sorry, we're using a degradation model because we don't have run to failure data. We have cumulative data for degradation of these pumps over time. And so with that said, let's take it-- let's actually dive into MATLAB and see what that looks like.
Now we're back in MATLAB. And we're looking at another script, this time focused on remaining useful life estimation in the case of a valve blockage.
In this case, again, we're going to start by acquiring or loading in our data. And here we have data that's been collected continually specific to the valve blockage. So as that blockage became more and more severe, we continued collecting data from the pump.
And in addition to just collecting the raw data, now we've actually gone through and created a feature set. Again, just like what we did with the Diagnostic Feature Designer, we have captured a number of features from this data pertaining to the degradation over time.
And for a remaining useful life model, we want to extract or engineer one feature of that set which best describes how that pump is degrading, again, specific to the valve blockage.
We could have selected from any of the existing features that we initially extracted from this data set. But actually, in our case, we found that we were better off engineering a feature. So we used some principal component analysis to combine all-- several of those features from our-- that we extracted and create a new set of these principal components, which are features, which, in our case, better described the degradation.
So we were actually able to use the first of those principal components to track the degradation over time. And for us, that was good enough. It's going to be different, depending on your specific application or the data that you're working with. But that's one option you can look to.
And again, you could use run to failure data for this if you had it. In our case, we didn't have it. And we can still use this compiled degradation data to extract the same information and apply the model.
So I'm going to show you what this looks like now. So this is our exponential degradation model. And it's used historic data to develop an estimator so that as new data comes in, this-- basically, the blue line, which our health indicator measured, and then our red area, the red region, is where we're predicting that we'll encounter our remaining useful life threshold-- so that threshold, as you can see here, is set to minus 9.
Now, how we arrived at minus 9 is by looking at the data we already had. We ingest the data. And then MATLAB is used to generate that health indicator-- so in our case, the first principal component.
So we have that health indicator over time. And we're able to look at historic data and say that, here at this point in time, we felt that this pump was at a point where we didn't want to let it run any further. We were confident that if we let it run any further, if we continued to operate the pump, we might encounter failure. So we assigned a comfortable threshold based on the data we had.
In your case, you could normalize this and have a threshold of 0 or you might base your threshold-- you would want to base your threshold off of the information you had at hand. But in this case, you can see that we were able to get this model. And our estimate for remaining useful life is going to be much better than a guess at the end of the day. It's going to give us a good idea about when we might need to perform that maintenance.
So we just finished building the remaining useful life estimator for those pumps that we had. But what if we didn't have access to enough data or we just didn't have enough data to actually develop that model?
Well, if we had domain expertise, we could actually use our physical modeling platform, Simulink, to build up a model of that pump and artificially generate failure data. So the way this works is you have a physical model or you have a model of the physical system. And you inject a simulated failure into that model to get failure data out. And then we can use real-world data to help refine and improve that model.
So let's take a look at what that looks like inside Simulink. So now we're in Simulink, which is our modeling and simulation platform. And we can see here that I've had a subject matter expert come in and build us a model of this pump.
If we dive under the hood of the pump model, we can see that it's assembled from some more basic components. And even further inside, each of these plungers and individual pieces are created using some of the physical modeling capabilities that are built on top of Simulink. And indeed, we also see that, although currently off, all of the failure modes that we had discussed earlier are also modeled so that we can turn those on and off and artificially create that failure data as we need.
So to simulate a healthy work cycle for this pump, we can just leave it as is and hit Run. And now that it's finished running, we can come up to Simulation Data Inspector, which is a nice tool for analyzing individual runs. We want to do a comparison here, eventually. So I'm going to go ahead and give us the correct layout. And then we're going to look at our flow rate-- so qin versus qout.
So this is what a healthy work cycle looks like. So the next thing we want to do is inject a failure and simulate that. So built right into the top level of this pump model, we can bring up our pump options. And we can see there's a Fault tab here where we can pretty quickly and easily reconfigure any combination of these failures and simulate the model.
So I'm going to turn the seal leak on for plunger 1. And then now we're going to generate some failure-- artificial failure data where we're simulating a seal leak. So I just hit Run.
So now that that run is concluded, we have some simulated failure data. So I'll bring back up our Data Inspector. And you can see here that for the current run, we have that failure mode flow data already displayed.
And I can go into the archive here and pull up one of my healthy runs and plot that on the bottom chart to give us a nice comparison. So qin, qout-- and we can clearly see that there's a difference, marked difference, between the two runs. In this upper chart, this simulated failure data can give us a leg up on developing a predictive maintenance workflow if we didn't already have enough data from the field.
Now that we've seen how he could create some artificial data to supplement our existing data sets inside Simulink, I want to move on to the next part of that workflow, which is deployment.
So right now, everything lives in MATLAB, which, for development, is fine. It's great. But we want to make sure that, at the end of the day, the prototype that you assemble in MATLAB can be put out into the wild, somehow, to enable your colleagues, team members, coworkers, et cetera to be able to access and really use the information and what you've assembled to make those intelligent, data-driven maintenance decisions we talked about earlier, answering those three questions.
So for this, we have a couple of different options. 1, we could do something like generate C code and directly embed the solution-- so this predictive maintenance algorithm-- onto an edge device.
Alternatively, we could go more in the direction of that dashboard I saw-- I showed you earlier and do something like compile these predictive maintenance algorithms and then deploy them to a web app server or a web app or somewhere onto the cloud.
So our full workflow looks a bit like this. Remember that we were trying to use data to make more intelligent decisions based on our three questions. And to do that, we leverage this workflow.
Additionally, if we didn't have access to the data or enough data, we simulated our physical system in Simulink and generated artificial data to supplement what little we did have.
And finally, we took that solution and showed a few options for how we might be able to deploy that so that our end users could really benefit from actually use the information to make those intelligent decisions.
Now, I'm going to pivot and show you a few of the options we have, helpful resources that MathWorks provides, to aid you in whatever predictive maintenance endeavors you're embarking on. Specifically, I wanted to call out two things. First is a training course that we offer that's-- takes place over two days. And it's an instructor-led training specific to predictive maintenance.
And second, if you need a hand up in developing or standing up your predictive maintenance solution, we have a great consulting team that's eager to work with you to develop a solution specific to your requirements.
So in conclusion, today I showed you how you might quickly and more easily stand up tasks and implement predictive maintenance programs inside MATLAB. And even in the last 30 minutes, we just addressed both a basic form of the fault classification and remaining useful life estimation with our tools. And I really thank you at this time for your time. And we-- happy to stick around and take questions from the Q&A. Thanks very much.