Data Engineering for Engineering Data - MATLAB
Video Player is loading.
Current Time 0:00
Duration 46:01
Loaded: 0.36%
Stream Type LIVE
Remaining Time 46:01
 
1x
  • Chapters
  • descriptions off, selected
  • en (Main), selected
    Video length is 46:01

    Data Engineering for Engineering Data

    Overview

    Large collections of timeseries data power applications like predictive maintenance, digital twin models, AI with signals, and fleet analytics. In this webinar, we explore options and implications for how to efficiently organize and store large timeseries datasets to support downstream applications.

    Highlights

    • Accessing raw data from different files and sources
    • Organizing data utilizing different table schemas
    • Storing data with the Parquet file format
    • Analyzing large datasets with datastores and tall arrays
    • Building AI models with out-of-memory sensor data
    • Accelerating workflows with parallel and cloud computing

    About the Presenter

    Adam Filion is a Senior Product Marketing Manager at MathWorks where he focuses on building demonstration and teaching materials for the MATLAB platform. You can also find him teaching the Practical Data Science with MATLAB specialization on Coursera and in many other MathWorks videos. He has a BS and MS in Aerospace Engineering from Virginia Tech.

    Recorded: 19 Mar 2024

    Hello, everyone and welcome to today's session on data engineering for engineering data. In today's session, I'll be covering an overview of how best to organize, store, and access large collections of time series sensor data to support downstream applications like predictive maintenance, digital twins, signal-based AI and fleet analytics.

    To start, I'll give you a little bit of background on myself. My name is Adam Filion and I come from an aerospace engineering background. I've been with MathWorks for about 13 years now in many different roles, most of them related to data science. These days, I spend most of my time building demonstration and teaching materials to help people learn MATLAB. You can find me in many other videos on our website as well as on Coursera, where I was one of the instructors for our practical data science with MATLAB specialization.

    To start, I want to talk about the big benefit of data engineering, which is that well engineered data can speed up downstream analysis. Here I have two videos that have been sped up of doing the same analysis, which is training a simple AI model on a single pass through a large collection of aircraft sensor data. In the first video, we're working directly with the raw data, and we're working with it fairly similarly to how we would work with normal in-memory data. This model training will complete and give the right answer, but it's going to take a while.

    If we take the time to engineer a better data set first and then we leverage some of the techniques we'll talk about today like predicate push down, we can do all the same analysis much more quickly. In the setup I'll describe later, this cuts the time it takes the model to pass through the data once from over 16 hours down to less than one hour. This is the same data, running on the same hardware, producing the same result. All we're changing is how we organize, store, and access our data.

    The speed up is a benefit we will get each and every time we go back to our data for another round of analysis. So data engineering is not required to get the job done. Instead, I want you to think about what is your time worth, because that's the big benefit we're talking about today.

    With that example in mind, we'll start today by spending a few minutes on a high level overview of the terms and workflow for data engineering. We'll then move into a MATLAB demo using some aircraft data from NASA, which is where we'll spend most of our time today. And lastly, we'll wrap up with some resources you can use after today's presentation.

    The central challenge we're trying to solve is how to efficiently organize, store, and access huge time series data sets. These data sets are often generated by fleets of equipment. The term fleet is often associated with cars or planes, but really, it refers to any collection of physical assets that are generating sensor data. So you can have a fleet of robots working on a manufacturing line, or a fleet of wind turbines, or a fleet of wearable devices and so on.

    This situation arises in many different industries and applications, and in order to derive value from the data you collect, you need a good way to store and work with it. The raw data is produced in many different forms. It could be data packets sent over the air or small, highly compressed files, or many other options. Before we can use this raw data, we must import and integrate it into some kind of central location. You'll often hear terms like database or data warehouse or data lake, so something that starts with data and stores stuff.

    The job of building and maintaining the pipelines that import and organize the data in these repositories is handled by people often called data engineers. This data is then accessed and used by data consumers for downstream applications. For smaller scale applications, these two roles may be filled by the same person, and I'm willing to bet all of you have done a little bit of both in your careers.

    On the other extreme, and really the worst case, is when these two roles are filled by separate, siloed organizations who never talk to each other and have no understanding of each other's jobs. So one goal of this presentation is to give a bit better understanding of both roles and how their requirements interact with each other. The main benefit of addressing the challenges of data engineering is you can more efficiently iterate with your data.

    Let's say our raw data is a large cloud of small binary files where each file might be the equivalent of one trip of a car. We want to take this raw data and do something useful with it. Exactly what it is doesn't really matter. It could be some simple statistics or visualizations, or maybe advanced AI. The point is, if we only had to do this once, then the situation might be fine.

    However, throughout the lifecycle of a project, you will need to iterate back to your data hundreds, or thousands of times. And if that raw data is in a painful or inefficient format, that puts a big bottleneck on your ability to make progress. So instead, we can transform our raw data into a new, easier to use format. In this example, we may transform our data into a smaller number of larger Parquet files. Then when we iterate while doing our analysis, we iterate on this new reformatted data rather than the raw data.

    The first part is what we'll refer to as data engineering and the second is data analysis. The data engineering workflow follows a certain pattern. We'll borrow a term here from the database world and refer to this as an ETL workflow-- Extract, Transform, Load. The first step is to extract the raw data. You need to find a way to read from the original data source in chunks small enough to fit into memory.

    The second step is transform, and this is where most of the work happens. It often starts with something simple like data type conversions, but the big thing we need to do here is enforce a schema. Schema is another term we're borrowing from databases, and it really just refers to, how is your data organized? If it's a table of data, then what are the columns, what are the data types? In order to efficiently work with a large data set, it needs to have a consistent schema throughout. If we ever need to change the schema, we need to reprocess the entire data set into that new format, so changing the schema down the road can be expensive.

    Lastly, we may want to perform some standard steps like data cleaning or preprocessing or feature extraction so future users of this data set get that for free. It's because of this step that many people choose MATLAB as their data engineering tool for engineering data, because MATLAB provides best in class tools for cleaning and preprocessing time series sensor data. Eventually we'll be happy with the new format of our data, and the last step is to load it into a new location and decide if we want to do a one to one conversion or consolidate smaller files into larger files. This new location and writing style will change depending on your project and requirements, but today we'll focus on files stored in S3.

    So far, everything we've said is true of all data, but this session is specifically on data engineering for engineering data. So what's different about engineering data? Two things. Engineering data is ordered and not easily partitionable. Engineering data can come in many different shapes and sizes, but one common form looks something like this. We have a table of data with an ID for something like a trip or a flight, a timestamp for when the data was recorded, and then various sensor values.

    The first thing to notice is, much of the analysis we want to do requires looking across a time series. If we want to do something like identify outliers or do frequency domain analysis, we cannot analyze rows, individually. In other words, the rows are not atomic. In this example, the flight is the atomic unit of data, and we must preserve the grouping of the flight as well as the order of data within it.

    This also means if we want to partition this table into different groups on disk, we must take care to not split up our atomic units when we do so. These are some of the additional challenges engineering data introduces that we need to tackle. Before jumping into our demo, I wanted to share one example of folks using MATLAB for big data engineering and analysis at scale today.

    Ford gave a great presentation at the MathWorks automotive conference on how they engineer and analyze massive amounts of automotive fleet data. They use MATLAB on top of Apache Spark to first organize and then analyze ADAS feature usage, among many other things. A common analysis task is to start by trying to find a relatively small number of interesting events, like when ADAS features engage, you can think of this as a macro operation. Then they run some non-trivial analysis on the result, such as evaluating the performance of ADAS features. You can think of this as a micro operation.

    And today, we'll look at an example that is fairly similar to what's happening at Ford. So let's jump in. Our case study today involves public data published by NASA from commercial airline flights. The full data set is large enough that we can't load it into RAM at over 280 gigabytes of MAT files. It consists of over 180,000 flights from 35 aircraft and is provided as one MAT file per flight.

    The workflow we'll go through is we'll start with a single raw file and we'll prototype how to organize and store it. Then we'll apply that schema along with some preprocessing to the data set and write the results out to a new location and file format. Once we've completed that ETL workflow, we'll take a look at an example analysis problem, in this case, building a virtual sensor to predict the true airspeed for an aircraft at cruise. And we'll see how understanding the downstream analysis can help us engineer a better data set. And with that, we'll hop over to MATLAB.

    So here we are in MATLAB, and I've got a live script open that I'll use to step through our example today. If we're going to be working with big data, then the first thing we need to think about is leveraging as much parallel computing as we have available to help accelerate our analysis. So I'm going to open up a parallel pool of MATLAB workers. These are just extra MATLAB engines, extra MATLAB.exes running in the background that I can use to help parallelize my work.

    I'm going to ask for 96 of them, and if 96 seems like not a lot for a normal computer, well, that's because this is not a normal computer. This is something that I am running from MathWorks Cloud Center. So Cloud Center is just one of many different ways you can run MATLAB in the Cloud. Cloud Center gives you an easy way to bring up either a single session of MATLAB in AWS or start up an entire MATLAB Parallel Server that spans multiple machines. Working in the Cloud can be great because you can start either locally or with a small cloud machine to prototype your algorithms, and then when you're ready to scale up, just rent a bigger machine, and all the same code will work.

    Today I rented one of the biggest machines out there so we can have a lot of MATLAB workers and get through the data more quickly. I'm also working in AWS because we're storing our data today in S3. So I've already downloaded our data today into an S3 bucket here inside of this MAT file folder. You can see, it's organized with one folder per tail number per physical aircraft, and then a different folder for each year and month. And then all of the individual MAT files. So we have one MAT file per flight, so about 180,000 MAT files in total.

    OK, so back over in MATLAB, we've opened up our parallel pool of 96 workers, so that's available to us. I've also used a dir command just to call out to our S3 bucket just to make sure that we can access it, and yes, we can. So now that we've done some basic setup, our first step is to extract the raw data. We need to figure out how to read from our raw data source. And it's always easier to start with a local copy, so I've already downloaded a local copy of the very first MAT file. And because NASA provides this data in the form of MAT files, figuring out how to extract it is very trivial.

    We can just load that MAT file into memory, and when we look at its contents, we can see that there are 186 different variables inside of this MAT file. So that's one variable per sensor. And if we look at the details of what's inside of that variable, each one is a struct. The struct comes with metadata, such as the sample rate and the units, and then the actual sensor values are stored in a nested double array.

    So this is not a terribly convenient format to provide data to users, and at the same time, there is some very important metadata that is stored only in the name of the file. The name of the file is the flight ID, that is the unique identifier for this flight, and it is constructed by concatenating the plane's tail number with the start time of the data logger. And this information right now is available nowhere except in the name of the file.

    So our first step is we want to think about, how can we better organize this data so our users have an easier time working with it? So now we're ready to start thinking about, well, what kind of a schema do we want to use? There are many different ways to organize engineering data. Three simple tabular forms are the wide, narrow, and nested schemas, and each one has its own pros and cons.

    The wide format is the most familiar to engineers. Each timestamp is a row, there's some metadata like the plane's tail number, and then the various sensors. The big pro of this format is having each sensor in its own column makes the analysis code very easy to write. In fact, many of the functions that operate on sensor data expect it to be in this format before you pass it to the function.

    However, there are several big drawbacks. The biggest one is that since each sensor has its own column, changing sensors requires changing the schema, which requires reprocessing the entire data set. Nearly all real world projects will change sensors throughout their lifecycle, so this format isn't commonly used for data storage on big projects. However, if you're confident your project won't be changing sensors over time, then storing data in this format can make a downstream analysis much easier.

    The narrow format pivots the wide format so that each row is an individual sensor reading. Sensors are no longer in their own columns, and instead the sensor name is specified within each row. Importantly, this means you can change sensors without changing the schema. However, you'll now have many duplicate values, vastly increasing the data size.

    The nested format is similar to the narrow format with one key change. The sensor values are stored in a nested array. This means that each sensor is contained within one row. This has the benefit of dramatically reducing the number of duplicate values. However, this type of data nesting is not supported on all storage systems, and it does put a maximum size on your sensor data. Since a sensors entire data must fit in a single row, if you left a high sample rate sensor running for a long time, you could potentially record enough data with one sensor that a single row of this table will not fit in a memory.

    So there is no best format, each one has its uses, but due to its smaller in memory size and the ability to change sensors without changing schema, the nested format is the one we'll use today. So now that we've settled on a schema, we need to think about how to take the raw data we have and get it into that format. And fortunately for us, the struct two table command will get us most of the way there. This command will take a struct and convert it into a one row table, and you can see this already looks a lot like the nested format that we saw in the slide.

    For the sake of time, I'm not going to go through all of the details today, but the basic idea is we're just going to run struct two table in a loop for each of the 186 sensors, vertically concatenate the results, do a little bit of massaging to get the table into a nice format, and then wrap that up into a helper function, which I'm calling return nested, which will take any similarly formatted MAT file and output a table. Inside of this table, we have explicitly written all the most important metadata for us to search by. So we have the flight ID, and that was a part of our table, all of the different metadata values from the struct, and then the sensor values in these nested arrays.

    So now we've seen how to extract the raw data from the MAT file, we've applied a schema to it, but we've just done this with one file. Next we need to think about, well, how am I going to do this with a large collection of files? And the way we're going to do that is using something called a data store.

    So data stores are the front door to big data in MATLAB. There are many different types of data stores for working with different types of raw data such as CSV, Excel, MF4, SQL databases and so on. Today, I'm going to use the file data store, which is a very flexible general purpose data store that can work with any kind of file as long as you can provide the function on how to read that file.

    So I'm going to point it to our S3 bucket, and since today is just a demonstration and we want things to work quickly, I'm just going to point it to one of the 35 aircraft. But all of the same code works exactly the same way if you point it to the whole data set, you'll just wind up waiting a little bit longer because you've got 35 more planes. We can also pass in our read function, so this is just the return nested function we just looked at. And if we look inside of this data store, we can see this does not contain the data itself. It contains the information we need to know about how to read the data, such as all the various files, and we can see there's about 4,400 of them for this one aircraft.

    There are a number of different ways we can work with and interact with data stores. The simplest one is to simply read from it. When we read from a data store, that will read the next chunk of data into memory. And for a file data store, that means just read the next file. And so the first time we read from the data store, we'll get the exact same result that we saw earlier because we're applying the same function to the same file. So at this point, we can access our large collection of data we can get it into a good schema.

    Now we need to think about, well, what kind of preprocessing do we want to standardize on? And just as one simple example, let's say we want to take a look at the units column here. So you can see that some of these say undefined. So undefined is the standard missing data token in MATLAB for categorical data. But some of these entries also just say units. This actually also means missing. So it would be really helpful to our downstream data consumers if we would replace it with the standard missing token at this point so they don't have to worry about that.

    And we're going to incorporate that preprocessing using something called a data store transform. So data store transforms are very simple. They accept two things as an input-- another data store and then a function that we want to apply to it. And then when we read from the transformed data store, it will first read from the input data store, and then take that result and pass it to whatever arbitrary function we give it.

    In this case, we're going to use MATLABs built in standardized missing function to tell it, look at the data we just read, and whenever you see this string inside of the units column, that actually means that it's missing so go ahead and replace it. And if we look over at the units column, you can see now those first four rows have been replaced with the standard missing token. So data store transforms make it very easy to introduce, really, any kind of arbitrary computation that you want into this data store workflow, and you can chain however many of these transforms you need together to do all of the analysis that is necessary.

    At this point, let's say that's all the preprocessing that we need to do and we're ready to move to the last step in the ETL workflow, which is to write it out to a new storage location and file format. And I'm going to do that using write all. So write all is one of several different ways of working with data stores. This tells the data store, I'm kind of done with my workflow. I want you to go through the entire data set, one read at a time, parallelize it with MATLAB workers if you can, and take the results and write it out to a new location.

    So I'm going to give it my transformed data store. We're going to ride it out to a new folder in our S3 bucket. We're going to tell it to duplicate that folder layout and we're going to output it as a Parquet file. This takes about a minute to run, so let me jump back to the slides for just a moment and talk a little bit about those Parquet files.

    So Parquet is an open source file format for storing tabular data. It has become very popular for storing big data for several reasons. It's very fast, very compact, it's great for group-based calculations, and speeds up analysis through a technique called predicate push down. So let's see how that works. At the highest level, you have the Parquet file. Within the file, you have row groups.

    A row group is the smallest amount of data you can read from a Parquet file. It can be as small as a single row or as large as the entire file. You control what goes into each row group, and they don't all have to be the same size. Within each row group is the raw data, along with metadata about that row group, which gets created when the file is written. The most important metadata is the max and min of each column.

    When reading data, we can use a tool called row filter to selectively read row groups based on the metadata. For example, if we only want the physical plane with tail number 678, row filter will inspect the min and max of the tail number column for each row group and identify row groups that could contain tail number 678.

    Then when we read the data, we only read data from the relevant row groups, which can substantially improve performance versus reading the entire data set into memory and then indexing it there. If the row group contains more rows than we want, then row filter will continue filtering in memory after reading the data off of disk. So because of how row groups and row filter interact, deciding how to organize your data on disk in row groups can have a huge impact on the speed and simplicity of your downstream analysis.

    OK, so we're back in MATLAB. That write all took about one minute to run. We're only working with one of the 35 aircraft here, so that transformed about 4,400 MAT files into a equivalent number of Parquet files, which totals about six gigabytes on disk. So that's a very brief overview of how you can do an ETL workflow in MATLAB. It always starts with a data store. We can then apply however many transforms that we need, and then write all gives us an easy way to take the results and write it out to a new location and format.

    Let's say at this point that's all the ETL that we need to do and we're ready to start thinking about how to analyze this new data set that we've created. Well, we're talking about consuming big data, so we're again going to go back to a data store. But now that our data is in Parquet, we can use a specialized Parquet data store to work with it. Once again, we can just point it to our bucket in S3. And just like the file data store we saw earlier, the Parquet data store does not contain the data itself, it contains information about how to read the data.

    Two of the more important parameters, first is the read size. We can have this read either the entire Parquet file or we can have it read one row group at a time. We can also tell it which variables, which columns of the table are we interested in. And because Parquet uses columnar storage, this is a very fast and efficient way to read only the columns off disks that we care about. But that's how we read just the columns that we care about. What about just reading the rows that we care about?

    Well, we talked about row filter back in the slides. And we can add a row filter to our Parquet data store and then set conditions, very similarly to how we would use logical indexing in MATLAB. So if I just wanted data from one particular month, I can tell MATLAB, look at the start time for the data logger, and only give me back data if it is greater than some start point and less than some endpoint, and then assign that row filter back to the data store. And now whenever I read from my Parquet data store, it's only going to read the raw data off disk if the metadata says that this contains data from the month that I care about.

    However, the first time I read from this data store, you'll notice that I do get an empty table back. And that's because, actually, the vast majority of this data was not collected in this particular month. So row filter is a fantastic tool for searching through large collections of data to find an interesting subset when you don't have a good way ahead of time of knowing where that subset lives. So it's really sort of the find the needle in the haystack kind of problem.

    But if you're going to be going back to the same subset over and over again, there are even more efficient ways to do that, and I'll talk about two of them today. And the first one is to just be smart about how you leverage directory structures. So earlier I showed you in our S3 bucket what the folder structure looks like. And you may remember that one of the layers in that directory was the year and month when the data was recorded.

    So if we know that we're going to very often subset our data down to just a certain month or just a certain year, it can be very convenient to build that organization directly into the directory for how this data gets stored. Then I can subset my data store based on some condition, such as when that file path contains the folder that I care about. Now this removes all the files that didn't meet that condition, and so now every single time I read from this Parquet data store, I'm only ever going to touch the data that is relevant for my analysis.

    So at this point, we figured out how to access our large collection of Parquet data, the new data we just created. We figured out how to subset down to just the month that we care about. Now we're ready to start thinking about how to analyze this data.

    It's important to remember that data engineering does not exist in a vacuum. It exists to serve a purpose, which is to support and enable the downstream analysis of this data. This means it's important to understand ahead of time what kinds of analysis will be done with this data. And we can think of analysis as falling into one of two types, what we call for each and across all analysis.

    An example of for each would be for each flight, extract the cruise section. So this time of analysis operates by group. An example of across all would be finding the fuel efficiency across all flights in our data set. This type of analysis uses the entire data set for a single operation. For each analysis starts with data access using data stores. We can then filter to a subset using row filter or one of the other techniques we'll talk about today and perform a for each computation using data store transforms, which we saw earlier in our demo.

    This is the part where an intelligent choice of row groups can make life much simpler. Because our data store will read one rogue group at a time and because we put each flight into its own row group, we can do per flight operations essentially for free. We can write our algorithm to extract the cruise section of a flight, exactly as if we were working with a single flight in memory, because that's what the algorithm will get when we use our data store transform. As we've seen before, to execute this analysis across the full data set, we can use read all to read all the results into local memory or write all to write the results out storage.

    Across all analysis starts the same way, by accessing the data with data store, and if applicable, filtering and applying for each analysis. When we're ready to do an operation across the entire data set, we have two different options depending on what we want to do. If we're interested in data analysis tasks like statistics, data cleaning, or visualizations, we can use tall arrays. A tall array treats our entire data set as a single, vertically concatenated array, and allows us to write normal MATLAB code to operate across the entire data set even when it doesn't fit into memory.

    Similar to data stores, tall arrays do not run immediately across the entire data set. Instead, they wait for a trigger to figure out what is the most optimal way to process the whole data set. One such trigger is the gather command, which gathers the results into local memory, which is similar to read all.

    In addition to tall arrays, there are many other functions that accept data stores as inputs, especially in the AI space, where we often want to train models on more data than fits into RAM. One easy way of doing that is using trained network, which we'll use today, so let's hop back to MATLAB and see how we do that.

    So we're going to start by applying some for each analysis with a couple of transforms. And you see, we're going to chain a couple of them together. We're going to start with a little helper function we wrote nested to wide to convert that nested format into the wide format that makes our analysis code easier to write. So you can see now we have each sensor in its own column. And you can see, some of them have missing data. And this is because some of the sensors have different sample rates. So we need to think about, how do we align these sensors?

    And in this example, we'll just go with the simplest way. We'll fill in those missing values with whatever the previous value was. This is kind of the old school engineering zero order hold approach. And then lastly, we'll use another helper function to extract the cruise phase of the flight. It does so by looking at the altitude sensor and uses a function in MATLAB called is changed to look for change points in the altitude to try to find a level set of the altitude in the middle that might represent cruise.

    Now when we read from this last transformed data store, we're going to get the aircraft data that happened at cruise, so you can see the altitude here, at about 28, 29,000 feet. But if we keep reading again and again from this data store, you'll see sometimes, again, we're getting empty tables back. And this is because not every flight in our data set has a cruise phase. Some of these flights were very short and some of them weren't even really flights. Somebody just turned the data logger on and then turned it back off without the aircraft ever leaving the ground.

    So now we're kind of back to the same situation we were in with row filter. There's overhead every time we touch a file, and it would be much faster and more efficient if our data store only ever looked at the files in row groups that are going to give us usable data. So if we've never done this analysis before it the only way to know if a certain row group or file contains cruise data is to just do at least one pass through the data.

    So in this case, we'll create one more transform. We'll check if that cruise table is going to be empty or not. This gives us a single logical value, one byte of information back. And well then read all of that into local memory. Now you need to be a little careful with read all, because this will read everything into local MATLAB RAM, but we're just getting one byte of information per row group, so we can actually read quite a lot here.

    And what this will tell us then is at a per row group level, will this row group eventually yield usable data at cruise. And we can then subset our data store using that logical index. And so now with this new data store, every time we read, we're only ever going to get usable data at cruise, and that's the only data our data store will ever touch.

    So like I said, if you've never done this before, the only way to generate those indices is to do at least one pass through the data. But if you know you're going to come back to this application over and over again, if you know you're regularly going to be looking at the cruise phase of flight, well, you don't want to do that pass through the data every time, you want to do it once and then save that information somewhere else.

    Now this is one of those places where there's really no industry standard. Some data platforms will do things to help you with this, many of them expect you to manage this on your own. Just one example of what this might look like is something that some platforms call a sidecar file. So you think of a sidecar, sort of the little thing that rides along next to a motorcycle, so a little file that rides along next to our big data.

    And one form the sidecar file could take is another Parquet file that lives at a higher level in our directory. And inside of this Parquet file is one row per file in our data set, and then some indexing metadata. It could be at the file level, so one byte per file, just yes or no, does it contain cruise. Or if we have multiple row groups in this file, we could nest logical arrays telling us at the row group level, does this particular row group contain cruise data?

    And this way, the next time I want to come back and look at cruise data, I don't have to scan through the whole data set. I also don't have to modify the raw data, I can just come back to the sidecar file. And in the future, as we come up with more applications for this data, let's say we want to look at engine fires. There are examples of engine fires in this data set. We can do one pass through the data to figure out where all the engine fires are and then take that indexing data and just append it to this small sidecar file. So the next time we want to look at engine fires, we just have to load up the sidecar file and then we can immediately subset down to only the relevant data.

    OK, so at this point, we figured out how to access the data we want to analyze, how to subset down to only the relevant data, and how to do some for each analysis to extract just the cruise phase of the flight. Let's say at this point, we're ready to train a very simple AI model to try and predict the true airspeed. In order to do so, there's a couple of things we need to do.

    First, we need to define the layers of this neural network. And since this isn't really a demonstration on neural networks or AI, we're just using sort of the simplest possible network. We'll come back to this window in just a minute. So here, we have just a single layer of 10 fully connected neurons. So about the simplest neural network you can get.

    Then we need to define the options for how to train this network. There are a lot of options and hyperparameters that you can tune, but the only one I'll point out today is the execution environment. Since I have such a small network, today I'm just leveraging my local CPU for training. However, as you grow your networks, you will want to leverage GPUs or even multiple GPUs, and if you want to do so, all you have to do is flip this one parameter from CPU to GPU or multiple GPU and now you're training with that new hardware and everything else stays the same.

    Lastly, we need to do one last transform to get our data into the format the trained network expects. It once all the predictors and one column in the output in another column. And then we can train our network using our data store input with the layers and options that we just described, and we'll get a progress monitor. And it's very important that you can monitor the progress of your model as it trains and stop it early if you need to.

    The performance of these models always starts poorly because the network weights are randomized. Hopefully it will learn over time and eventually converge to a good result, but that oftentimes doesn't happen. Sometimes the model can't learn at all. Sometimes it will learn at the beginning, but then its performance will plateau at a level that's not really helpful, and sometimes it can even diverge. So it's very important that you can easily monitor how the training is going and then stop it early if you need to.

    In this case, since we're just doing a simple example, I just had to do one epoch, one pass through the data, which took about 90 seconds, and it eventually converged at a average error of about 60 knots. So if you're familiar with this kind of application, you probably already know that an air of 60 knots is not a very good result, but you probably also know that we should not expect a good result here.

    In our example today, we put no effort into feature engineering. We're just taking raw sensor values and feeding them in one row at a time. We're also working with the simplest possible network. Even so, this is still a good exercise to go through. It's always a good idea to start with the simplest modeling approach so that you have a baseline for comparison as you try out more advanced techniques. And now that we've laid out all the infrastructure with this simple approach, if we want to start adding more complexity, either with feature extraction through data store transforms or by trying other networks, either by defining more advanced ones ourselves with things like Convolutions or LSTMs or importing pre-trained networks from elsewhere, either ones pre-trained in MATLAB or in other platforms like TensorFlow or PyTorch, all we have to do is swap out this layers variable and everything else stays the same.

    Once we complete the training of our model, it's just a normal in-memory MATLAB variable and we can use it to make predictions on new data. So the next time we get a batch of the relevant sensors, we can just predict with our model on that data sample. And if we look at a histogram of the prediction results, if you think all the way back to that very first slide I showed with those video recordings that ended in a histogram, this is exactly how we generated those histograms. The only difference is in those videos, we let the script run for all 35 aircraft.

    So that's about it for the MATLAB demo portion. Again, today, we started with a big cloud of raw MAT files. We went through an ETL workflow using data stores transforms and write all to get that into a better format inside of Parquet. Then we used the Parquet data store as well as data store transforms and sidecar files to help us subset down to only the subset that we care about and do some for each analysis. And then we were able to train a very simple AI model across all of our data using the trained network command.

    Now big data is a big topic and unfortunately, we don't have time today to get into the details of everything you need to know. But there are two additional advanced topics that I wanted to briefly mention. The first is what is known as the small files problem. The idea is that having data spread across a large number of small files is inefficient, because each time you access a file, there is overhead introduced.

    This is the situation we wound up in today when we created our Parquet data using write all. Each small MAT file was turned into a small Parquet file with one row group. It is more efficient to consolidate smaller files into larger files. This is another place where Parquet row groups make this work easier. If we're operating on a per row group basis, then all our code stays exactly the same, but it is more efficient because we're working with fewer files.

    This is one of those areas where every problem is unique and it's very important to have fine grained control over the details. So the best way to do this in MATLAB is to create your own custom data store. We have some great documentation which walks through every part of a custom data store, so you only need to figure out how to fill in the blanks with your custom behavior.

    The second advanced topic is how to use MATLAB with all of the various data platforms in cloud services. Today you saw MATLAB AWS, working with data in S3. But you can use MATLAB with many different platforms and vendors, either interactively, as we did today, or by deploying MATLAB into a production environment. So let's summarize the main takeaways from today.

    Data engineering is a one time, upfront cost that saves time for every analysis you do in the future. Again, the big thing I want you to think about is what is your time worth. As I hope you saw, MATLAB simplifies data engineering with tools like data stores. And one of the benefits of using MATLAB as your data engineering tool is you can incorporate MATLAB toolboxes and domain specific techniques into your data engineering workflows so data consumers get those results for free.

    And finally, let's wrap up with some additional resources you can use to learn more after today. If you aren't sure how to get started, reach out to MathWorks to see how we can support you. MathWorks provides a broad spectrum of support options to our customers, from guided evaluations to hands-on, instructor led training, for consulting services and on-call technical support. So if you're looking for assistance with some of the topics from today, like solving the small files problem or getting MATLAB to run in your data platform or cloud vendor of choice, then we have many different ways to support you.

    Especially relevant for today's topic is a full day, hands-on, instructor led training on processing big data with MATLAB, which goes in depth on many of the topics from today, including data stores, custom data formats, tall arrays and working with clusters and the cloud. You can also leverage the MATLAB Central community site where, over two million monthly active users engage with Q&A forums, the File Exchange, and other community activities. And that's all I have for you today. Thank you very much for tuning in.

    View more related videos