Cleaning and Preparing Time Series Data - MATLAB
Video Player is loading.
Current Time 0:00
Duration 20:47
Loaded: 0.79%
Stream Type LIVE
Remaining Time 20:47
 
1x
  • Chapters
  • descriptions off, selected
  • en (Main), selected
    Video length is 20:47

    Cleaning and Preparing Time Series Data

    Time series data are everywhere. Whether it is from sensors on automated vehicles and manufacturing equipment, meteorological data, or financial data from the equities market, it helps us understand the behavior of a system over time. However, real-world time series data can have many issues like missing data, outliers, noise, etc. The data needs to be cleaned and prepped first before it can be analyzed or used for model development. Unfortunately, it is not always clear how to clean this data. Which algorithm should be used for filling missing values? Should outliers be removed first or noise? How is data that is measured using different sample rates synchronized? The process is iterative and can be very time consuming. In this session, we will show you how to use timetables with the new Data Cleaner app and Live Editor tasks to identify and fix common issues in time series data. We will cover different data cleaning methods using both code and low-code techniques that can make the data prep process more efficient.

    Published: 31 May 2022

    Hi, everyone. Thank you so much for joining our MATLAB Expo talk on cleaning and preparing time series data. My name's Heather Gorr and I'm a product marketing manager for the MathWorks in the area of MATLAB, specifically, working on a lot of data science and AI applications. And I'm joined by Ono.

    Hello, everyone. My name is Onomitra Ghosh. I'm also a product manager here at MathWorks. My area of focus is primarily the data science and the data analysis capabilities that you would find in MATLAB. And today, I'm very excited to be here to talk about some of the techniques in which we can make cleaning and preparation of time series data simpler and easier in MATLAB.

    Thanks, Ono. And don't forget to follow us on social media and share your experiences using the hashtag MATLAB Expo. And we've also provided our own information. If you want to keep in touch, we tend to-- we actually love to hear from you, and like to share things on our own channels. So stay in touch, please.

    And let's start with a poll. We're going to start out with a quiz. Go ahead and put it into the poll your answer. Which of these have time series data? Lot of good answers already. Seeing some great ones. The good news is that you're all right. Exactly. Right, Ono?

    Absolutely correct. Whether it's climate data or finance data or data from autonomous vehicles or machines, jet engines, we are often presented with time series data to work with. But many times, these kind of data are not ready to be used for modeling, when they're in their raw form.

    Exactly. And that's exactly what we're talking about today, the kinds of things that happen commonly with time series data, like sensor dropouts which cause missing data and outliers. And then these sensors tend to be really noisy, where you need to smooth or detrend.

    And you have different kinds of sensors with different time steps between the measurements. So we need to figure out how to combine those things together. And we're going to see all of these today with a real world example.

    And so I'm working on an example where I'm trying to predict the true airspeed of a flight, based on a bunch of sensor measurements. But specifically, while it's cruising, using machine learning.

    There's a bunch of different data. There's 4,600 different flights, a lot of sensors. And it exactly has that problem where we have 1 Hertz samples and 4 Hertz samples. And we need to figure out how to pull those together.

    And so I've recruited my friend, or my colleague Ono here, as he works with the teams working on these things, to help me use the latest and greatest in MATLAB to clean this thing up.

    Absolutely. Happy to help, Heather. So let's get started. And let's first load this data in MATLAB. What I am doing here is loading this data into a special MATLAB data type called timetable, which is specifically designed for time series data.

    What readtimetable did was it identified the time values in the data files for the 1 Hertz data. That's the one we are starting with. And it made it the index of the timetable. So when we want a particular value in this timetable, we can just look it up based on the time index. There are many other advantages also for using timetable, and we'll cover some of those later in this talk.

    Awesome. I'd like to see the data values there that you have in the display. For me, it's a little easier to visualize them in a plot instead of just looking at those values as they are now.

    So definitely we'd like to visualize them. But since this is actually a flight data, and there are latitude and longitude and I thought there's a better way to visualize them if we actually plot them up in a map, so we can use the geoplot command to plot the latitude and longitude positions. And wait a minute, this looks strange.

    No, I don't think that's right. I don't think it landed in the middle of the ocean. We would have a whole different set of problems to solve for this one. So that must be some kind of outlier or problem in the sensor. Any ideas how best to start figuring that one out?

    Yes. So [INAUDIBLE] we need to clean this data. And a quick and easy way to get started with cleaning data in MATLAB is by using the data cleaner app. This is a new app that came out in 22A specifically designed to get us started with data cleaning. We can bring up this app by running the command data cleaner, or going to the Home tab and clicking on clean data button.

    All right. So the data cleaner app is up. Let us load the timetable in the data cleaner app. We can load it from the workspace, since we already have the timetable in there. When the data cleaner first loads timetable, it automatically plots the first non-time variable in that timetable, which in this case is the FuelQuantity. But we can check the other variable check boxes, and that will plot all the other variable values for us to look at and understand.

    Nice. I like those statistics on the side too. I noticed right away, there's a missing data count there in the first plot for 48. But I never would have guessed that from the plot. So it's really nice to have that information. We need to explore that further, I guess.

    I think so. I think that's a very good catch. So we need to do something about the missing values. But let's take a look at the other values. There are some weird things going on with pressure and temperature too.

    Yeah, that doesn't seem right. There's some drop outs, like we're talking about, where it's going to zero. That doesn't seem right.

    Yeah. So we need to-- yeah, these are outliers that we need to probably take care of. But let's also look at latitude and longitude. That's why we were here in the first place.

    Ah, there it is. And so that also goes to 0, which was 0, 0, latitude, longitude. So yeah, that's another kind of outlier, I guess.

    It is. Most likely, what happened was something got switched off, and even though the flight landed over here, it automatically went to 0. So we need to take care of these outliers and missing values.

    But notice that they are not all of the same types. So we probably need different strategies to handle them. So let's get started. Let's first take care of the missing values of FuelQuantity. So I have FuelQuantity selected over here.

    And we need to remove these missing values. And we can very easily do that by clicking on Clean Missing Data button over here.

    When we click on the Clean Missing Data button, what happens is it brings up the Clean Missing Data pane towards the right corner over here. And it's doing a bunch of different things.

    First of all, it's trying to identify where are the missing values, and it's automatically filling in those missing values using a default linear interpolation method. In addition, it's also showing us exactly where those missing values were. Remember, you said you couldn't see where those missing values were in the chart. Now we know.

    Now I can, exactly.

    All right. So I think I'm good with the linear interpolation. If we want, we can choose some other methods. But for now, let's accept it.

    Yeah, makes sense.

    All right. So now that we have cleaned the missing values, let's now take care of the outliers of pressure and temperature. And just like before, we'll click on the Clean Outlier button on the top tool strip. So what happened?

    Doesn't seem like it's getting all of them in this case. It's also a little bit busy. The plot's kind of busy. But it doesn't seem like it's getting some of those low values.

    You're correct. I think-- let's try to zoom in a little bit. So we can use the zoom tool to zoom in on the plot. And you're right, it's pretty busy. So one way to actually make it a little less busier is to only plot the things that we need.

    And we can do that by unchecking these checkboxes in the legend. And you are right. I don't think it actually removed all of these outliers. So let's try to understand what's going on.

    Just like missing values, it's trying to figure out what the outliers are, and trying to use some default algorithms to remove them. In this case, it's detecting based on moving medium, and it's detecting them with a window length of 4.

    And I think that might be a little small for these values. So why don't we change that? We change that a little bit, let's say 5. And let's see what happens.

    OK, it did change. But I don't think it still completely removed it. Maybe the window length needs to be longer than 5 seconds. So maybe change it to 8. I think this may have done the trick. But if we quickly check the clean data, ah, there you go.

    There it is, yeah.

    I think we were able to remove all of the outliers over here, so why don't you hit accept. Now let's do something about the latitude and the longitude.

    So this is a different kind of outlier. But we'll still click on the Clean Outlier method, and let's see how to address this outlier. So just like before, I don't think the default way it removes the outlier is working. So let's see what it did.

    And linear interpolation, as far as I can tell, that doesn't seem like it makes a lot of sense with latitude, longitude. Especially if something's dropping out, we want that to just kind of be the same point, right?

    Well, in that case, let's use this previous value. And that way, it will stick to the previous value, which is where the plane has landed, and it stays over there.

    But I also think that moving median might not be the best way to actually remove outliers for latitude and longitude locational values, based on how the outliers are. So we can change it to the simple median, and it actually was able to remove the outliers and put the imputed values in the right place.

    Perfect.

    So I think we were able to clean our outliers and remove our missing data. So let's go back to our script. And there are two ways we can actually move forward.

    One is we can export our variables in the workspace, but we probably need to apply this same kind of cleaning algorithm for all the rest of the 1 HZ data sets you have.

    So why don't we generate a function that we can reuse? So we just generated a function, which contains the exact cleaning commands that we use to clean this data. We renamed this function to call it Clean1HZData. And we can call it in our script over here to clean the 1 Hertz data timetable that we had.

    So this way, what we did was we used a data cleaner app to interactively figure out what the problems are in our data, figure out the right strategies in which we can clean it, and then generate out code so that we can use it for other similar data sets.

    Nice. Can we look at the visualization again, to see if we fixed it properly?

    Absolutely, and looks like that worked. I think this is a much more reasonable flight path for this plane.

    Definitely. So now we need to do the same thing with the 4 Hertz data. And the same process.

    It's really no different. It's the same steps. We did timetable. And we also used the data cleaner app. But obviously, this is a different data set. So we use different cleaning steps, and we clean those data sets.

    Awesome. So now we have two different totally cleaned data sets. But we still need to bring them together because I ultimately want to train that model with one single timetable with all of those features. So we need to sort of join them, essentially.

    You're right. We need to definitely combine these two data sets. But join is probably not the way to do it. Joining works well when we are combining any regular data sets. But these are time series data. So they are at a particular point of time.

    What we want to do is actually synchronize. And fortunately for us , there is a synchronized command in MATLAB that we can use. But the thing is, I often do not remember how exactly to use the synchronize command.

    So what I'll do is, instead of directly trying to figure out how to use the synchronize command. I will bring up a live editor task to help me with this.

    There's just so many options. That seems like a great idea.

    Right, and so what you see in front of you is basically a small app. live editor tasks are a small app that presents all the different parameters in an interactive form for us to iterate over and figure out the right way to use the function.

    So in this case, since we are synchronizing these two timetables, let's select them first. And by default, it synchronizes them to form a new timetable. You can rename them to-- rename it to set t. And look at the results. The results show us--

    A lot of NaNs. A lot of NaNs. Maybe there's a better way to fill those out. I mean, I know there's a lot of spaces whenever we're trying to combine those two in particular. Do we have any more options to choose from?

    Absolutely. So we can actually fill in the missing values, using any of the fill missing value methods that we have in MATLAB. So in this case, we can use linear interpolation again to fill in the missing values, and it automatically updates the results for me.

    Perfect. And you said something about a function, right? I tend to like the coding part too. Can you talk more about the function?

    Sure. So what the live editor does is doing under the hood, it's actually writing out this synchronized function for me. Remember, I said that I do not know how to exactly use the function, maybe the parameters.

    This is how it's helping me to figure that out. And now, moving forward, if I want to use this function for other similar timetables, I can just copy this over. I do not need to open the live editor task and use it every time.

    That's really convenient. Are there other tasks like this that the team have been working on to help with data cleaning?

    Absolutely. So if we go to the Insert tab over here, there's an option over here for task, and this shows all the different live editor that are available in MATLAB. And we are continuously increasing this set to help with specific operations or functions in MATLAB.

    That's awesome. So I want to visualize all that synchronized data. Shall we go back to the data cleaner app and try to look at all of it-- look at all of it together at once?

    Well, that will be a little bit of overkill. So what we can do is use something called a stackedplot here in MATLAB. A stackedplot is a special type of plot.

    What the stackedplot does is it plots all of the variables on the same x-axis-- in this case, since we are using a timetable, it's a time axis. In addition, it actually has all of them connected together. So if we zoom and pan, it will zoom and pan together. Also, if we hover around, it shows this line connecting all the variables. That tells us their individual values at any point of time.

    This is awesome, especially since I really only need to keep that cruise altitude, or the cruise portion of the data. And that just hinted at the fact that I could see that the altitude, if you look at all those plots together, they kind of move like that. And the altitude is a good correlation to the true airspeed. So could we use that to kind of pull out those values, and just get that data that we need?

    Right, I think when the flight is cruising, the altitude has reached the cruising altitude, and it's not changing anymore. So maybe what we need to do is figure out when does it enter the cruising mode, or remains at a constant cruising altitude, and when does it come out.

    So let's go back to the script and figure out how to do that. So we'll use another live editor task called Find Change Points. This is already configured to find the change points in the altitude values. And it will find the two change points that we are looking for, where it enters the cruising altitude, and comes out of it. So there you go. It has found these two change points. These are the points where it entered the cruising altitude.

    Perfect. And then I could get the data from those points.

    Right. So let us then find out the exact times when the flight entered the cruising altitude and came out. So looks like it started cruising at 6:40, and came out at 8:15.

    Awesome. And then we can use that to get that data from the timetable?

    Exactly. And that's where one of the advantages come in of using timetable. There's a function called time range. Time range basically takes into time and creates a range of time indexes that we can use to index into the timetable.

    And so here, we use to start and end time. And we sliced off the exact cruising part of this journey. And we can look at the stacked plot again, with that sliced off part, and now there are none of those transient states where the flight is going up or coming down. And so this is your data that you can now use for your modeling.

    That's exactly what I need for a machine learning now. So this is awesome Thank you so much, Ono. We went from raw data. We visualized in several different ways. We explored, we use the data cleaner app, which is awesome. We then combined all the data sets together, and again, cleaned by removing those transient pieces.

    And we also did this by using a couple of different ways, right? Using functions, using those live tasks, the data cleaner app. Do you have any advice of when to use what technique? They all seem great. Any advice?

    Sure. And the advice is, really, it depends on you. It depends on how familiar you are with the data, and how familiar you are with the cleaning process. What MATLAB does, it provides a set of [INAUDIBLE] command line functions that can clean your data, as well as low code capabilities, like the data cleaner app, or the live editor tasks that can help you to get started.

    Now, if you're completely new to the data set, or don't know what functions to use, start with the data cleaner, just like we did today. Or if you know that, oh, we need to maybe synchronize, but don't know how to use the synchronized command, you're already in your live script. Use the live data task.

    Or if you know that you are the person who likes to code, and you just want to go ahead and code it, feel free to just go ahead and use the command line functions.

    And I love coding. But I also don't know what I'm looking at at first. So a combination of all these things seem like they could really work for me.

    Absolutely.

    Awesome. Well, thank you so much. This is so wonderful. We've gone through all of this, and would really love to hear from you all. Please put into the chat what kind of data cleaning challenges do you run into?

    We're here to help and here to share advice. And we've got the expert here to give you a hand. So thank you very much again for your attention. And we really appreciate it. We'll share all the handouts, and look forward to more Q&A in the cat.

    Thank you, everyone. And please feel free to chime in about what your data cleaning challenges are.

    Thanks.

    View more related videos