Data Preprocessing with MATLAB - MATLAB
Video Player is loading.
Current Time 0:00
Duration 9:14
Loaded: 1.78%
Stream Type LIVE
Remaining Time 9:14
 
1x
  • Chapters
  • descriptions off, selected
  • en (Main), selected
    Video length is 9:14

    Data Preprocessing with MATLAB

    Data preprocessing is a necessary step before creating a model, whether it be basic regression or machine learning. Data preprocessing takes the raw data and makes it analysis-ready through a variety of different processes depending on the issues with the original data set.

    Published: 5 Apr 2024

    So you want to do data analysis. If only it were so simple as you take the data, put it in the computer, and get a model. Unfortunately, more often than not our data is a mess. It's not the format we want, there's noise, extraneous and missing observations. If you try to build a model around this dirty data, it's just not going to work. Before

    We can get started on any analysis, we have to clean our data. This video is going to cover some basic ways to go about pre-processing data. But it's obviously not exhaustive, or I would be here all day. It would also be an incredibly boring video full of very domain specific methods. But that doesn't mean that we don't have resources if you're looking for more specific information.

    There are links in the description to some of our data cleaning landing pages that can help you get started with whatever data cleaning needs you have, no matter how niche they are. Some major methods we're talking about today are missing data, outliers, normalizing the data, and smoothing the data. If you're not sure which method is relevant to your problem, don't worry we'll go over when to use each method before getting to how to do it in MATLAB.

    For ease of viewing, this video is divided into chapters. Feel free to skip ahead if there's a particular topic you're interested in. With all that out of the way, let's get started.

    Missing data. Data is never as continuous as we want it to be. Respondents fill out forms incorrectly or leave entries blank. Sometimes a sensor shorts out for a couple of seconds. Sometimes someone doesn't take the notes they're supposed to be taking. Steve. The end result, however, is the same missing data points. Obviously, before we can do anything about these offending data points we need to identify them. Luckily, MATLAB has some built in functions and apps for recognizing different types of missing data, some of which are built into how MATLAB imports data sets.

    When importing a data set, MATLAB will convert values of the wrong type, incorrect categoricals, and blank spaces as nans, undefined values, and missing values respectively. All of which can be detected automatically using the IS MISSING function. Your notation for missing data points doesn't even need to conform to MATLAB's conventions. If you have unusual notations for missing values, those notations can be fed in as function arguments to IS MISSING function.

    But you don't even need to write a line of code. In Live Editor, you can pull in the CLEAN MISSING DATA task, letting you not only identify missing values but deal with them with a couple of drop down menus. Once we've found the missing data points we have two options to remove or not to remove. Now, removing missing data points is relatively straightforward. You just remove them.

    However, sometimes you don't want to delete the missing data points. Say, for example, that missing data point is actually a specific numeric value like zero, or the data is a continuous curve and the missing data points can be interpolated. We simply select FILL MISSING from the dropdown, and then choose a fill method from the neighboring dropdown. To replace the missing values with a constant, select constant value. Unspecified this defaults to zero, but it can be any scalar value.

    To replace the missing values with the nearest value, select nearest value. If you specifically want the next or previous value, specify next or previous value. As you can probably tell from looking at this dropdown, there are a variety of other ways to fill in missing values, different forms of interpolation to choose from, and even the option to create a custom function. For more information about these other fill methods, check out the documentation linked in the description.

    Outliers. Outliers are singular observations that are so outside the norm that they actually change the norm. A classic example is the factoid that the average person eats three spiders a year. The average person eats zero spiders a year. But a particular individual who lives in a cave and eats over 10,000 spiders a day is an outlier and skews the general average up. He should not have been counted.

    Now, obviously, you can't just throw out every observation that doesn't meet the pattern you want to see. But keeping outliers can seriously throw off your findings. So how do you determine which observations are outliers? In MATLAB, you can just use the IS OUTLIER function. And then you get a logical array of outliers that you can use to prune your data set. But let's do this a little bit more interactively.

    If we pull in the clean outlier task, we can replace this code with a GUI that not only removes the outliers but can show us our edits in real time. When you call the IS OUTLIER function or use the live task, the default detection method is moving median. This defines the outliers as any value greater than three local scaled median absolute deviations from the median. It's a relatively robust way to detect outliers as it uses the median distance to the median to determine which values are so far out the center of the data set to be outliers.

    However, it's not the most common way to determine outliers, that would be quartiles. This method uses the interquartile range, the difference between the 75th and 25th percentiles in the data set to determine outliers. With this method, outliers are defined as anything above the 75th percentile plus 1.5 times the interquartile range and anything below the 25th percentile -1.5 times the interquartile range. As is often the case, there are a variety of other options for determining outliers. Check out the documentation for more details.

    Once identified, typically you want to remove outliers from our analysis. But there may be reasons we want to fill outliers. We can choose different methods for filling outliers from the dropdown here. You may recognize some of these fill methods from the previous section on missing data. For more information about these other fill methods, check out the documentation linked in the description.

    Normalizing the data. Sometimes our data isn't as formatted as well as we'd like. And I don't mean when you have a spreadsheet and the text data is in cells that are too small for the text to be actually visible. This is formatting the data more in the sense of making sure the data is all in the same units or in terms of relative units. The name of the game here is scaling the data. We need data that's relative to the norm, hence why it's called normalizing the data.

    This is incredibly important when doing any kind of machine learning or other similar tasks. To train a model, you're looking for patterns in the data. And normalizing the data means you're no longer looking at the original units of the data. Now, as with many MATLAB tasks, there's a myriad of ways to go about doing this. Today, however, we're going to focus on the z-score standardization.

    The z-score of a value is calculated by subtracting the mean from the value and dividing the result by the standard deviation. When calculated for every observation in a data set, the result is a standardized data set that has a mean of 0 and a standard deviation of one. Similarly, you can use the z-score around the median absolute deviation instead of the standard deviation by selecting median absolute deviation from the dropdown.

    This is great in cases where your data set has outliers as the standard deviation is more affected by outliers compared to median absolute deviation. Click the link in the description for more information about z-score and other forms of normalization.

    Smoothing the data. Statistical observations come with a lot of noise, and smoothing is all about filtering out the noise. Smoothing is basically removing outliers to a greater scale, boiling the data down to just the trends. In this example, we have data about wind speed. Now, if you've ever been outside before on a windy day, you know that wind speed is variable. A gust of wind comes in a wave. We're not actually interested in the wind speed at every minute. Those values are too extreme.

    What we actually want to look at is the trend over time, say the mean for five minute periods. To do that, we call in the live task for smoothing data. Select the smoothing method moving mean and set the moving window to five. The end result is a graph where we can see the gusts of wind a lot more clearly. However, what if we want to preserve some of the features of the data such as relative maximum minima and width?

    Then we want to use a method such as the Savitzky-Golay filter. When we switch over to the filter and set the polynomial degree to two, we get a result that while still smoothed represents some of the more extreme values. Again, there are a variety of different filters that can be applied here. Check out the documentation for more information about all the ways to smooth data.

    Conclusion. Those are just some of the basic ways to pre-process your data. I can't say this enough, there are so many ways to pre-process data. There are whole live tasks that we didn't cover in this video such as computing by group, finding change points and local extrema, and removing trends. To say nothing of data specific pre-processing live tasks like model rate conversion or designing a filter for signal processing.

    Chances are if there's a way to pre-process some data, there's a way to do it in MATLAB. For more information on data pre-processing, check out the links in the description like our data cleaning page or check out some of our other videos about data analysis. If you liked this video, make sure to subscribe. Thanks for watching and happy coding.

    View more related videos