Description

Verification and Validation for AI Systems

Overview

Artificial intelligence has successfully been used to solve complex problems where traditional methods have failed. However, there is a trade-off between predictive power and explainability of the methods used. While deep neural networks (DNNs) models usually cannot be explained or interpreted by their users, they can provide better results than classical machine learning and traditional methods for many tasks, which is leading to a desire to put such complex models in production in safety critical situations through the application of Validation and Verification (V&V) methods.

By integrating the AI components into models, you can perform system-level simulations to ensure requirements are satisfied, and therefore deployable to the desired target platform(s).

The presentation will also outline how Veoneer (Autoliv) successfully labeled Lidar data for verification of a Radar-based automated driving system. The output of the labeling process was used to train deep neural networks to provide a fully automated way to produce vehicle objects of interest which can be used to find false-negative events. This provided substantial benefit to their validation process to verify their Radar sensors.

Highlights

A list of issues related to V&V for AI discussed include:

Explainability: Can you explain the working of the AI system in human-understandable terms?
Interpretability: Can you observe and trace cause and effect in an AI system and explain the rationale of the decision making?
Robustness: Is the AI system immune from spoofing and other common attacks to process reliable inputs? Adversarial inputs can help determine how robust an AI algorithm is.
Safety Certification: Has the AI system been developed with safety lifecycle as key component?

About the Presenter

Emmanuel is an application engineer at MathWorks who first joined the company as a training engineer. He taught several MATLAB, Simulink and Simscape courses as well as specialized topics such as machine learning, statistics, optimization, image processing and parallel computing. Prior to joining MathWorks, he was a Lecturer in Mechatronic Engineering at the University of Wollongong. He holds a PhD in Mechanical Engineering from Virginia Tech. He also worked as a Systems / Controls Engineer at Cummins Engine Company and as a research assistant in several research institutions in California and Virginia

Recorded: 13 Oct 2022

Full Transcript

Today, Dr. Emmanuel Blanchard, a man who has been an application within MathWorks now for several years, and originally joined as a training engineer. He's taught several of that courses for MATLAB Simulink and Simscape, as well as specialized topics including machine learning, statistics, optimization, image processing, and parallel computing.

Prior to joining MathWorks, he was a lecturer in mechatronic engineering at the University of Wollongong. He also holds a PhD in mechanical engineering from Virginia Tech and has worked as a systems control engineer for the Cummins Engine company as a research assistant in several research institutions in California and Virginia. So with that, I'd like to hand over to Emmanuel. Thanks, Emmanuel.

Thanks, again. Thanks for the introduction. So yeah, today, I'll talk about verification and validation for AI systems. As you just mentioned, Ian, I work in automotive industry. So V&V is quite important.

But what's changed from my days in automotive industry is now, AI systems are more and more common. So V&V-- Verification and Validation for AI systems-- what I want you to remember is the fact that it's an emerging and evolving field. It's developing.

But what I'd really like you to remember from this presentation is that best practices and methodologies are still being established. We're actively involved in different industries. But if you want to discuss your specific needs, we would be very happy to do so.

So I think you're all aware that artificial intelligence has been used successfully to solve many complex problems where traditional methods have failed or they did not perform well. I mean, sometimes, humans can still do it, like doctors could do CAT scan and find if there's cancer or not, et cetera.

But machines might be able to be more accurate than humans. And in some cases, like self-driving cars, you don't even have a choice. So I mean, this is why you want to work on verification and validation, because well, it's actually critical. Lives depend on the quality of your work here. And you'd like to use the AI.

So before I really start talking about V and V for AI, I'll talk about the opposite, AI for V&V, because what's really emerging these days is verifying AI systems.

But the current state of AI and verification, what we're seeing from customers really so far-- I mean, that is, in the past, it's more using AI to verify non-AI things. Or you can keep in mind is that you could eventually think about using AI to verify AI systems, too, in the exact same way.

So a quick example from a user story here. In the world of automated driving, sensing accuracy is very important. You need to prove your sensors can do the job. It's serious business.

That's why gone to labeling has an important role, in this case, for Autoliv's validation process. They need to annotate contrue data. And it's very tedious. It's manual effort. You need to find many important events of interest.

And for that, you use the human eye to determine objects from Lidar point cloud images. But humans spend hours and hours diving into recorded data, to analyze all events, and try to find something that makes sense.

So instead, they develop a tool in MATLAB to alleviate some of these pains to label point cloud data from a Lidar sensor. And so the capabilities include assisting drivers in visualizing, navigating, annotating objects in point cloud data.

So they track these objects through time over multiple frames. And then they use the label data for developing machine-learning-based classifiers. The output of the labeling process here is used to train deep neural nets to provide a fully automated way to produce vehicle objects of interest, which can be used to find false negative events instead of just doing all this manually.

So I mean, again, if you have it with humans, I mean, it takes as much time as you would to play by the entire data set. And with a fully automated approach, you can run that on several computers.

At the same time, you can reduce the analysis time a lot. So the slides here show the time savings that are provided, as well as the accuracy of the labels achieved, and how this approach-- yeah, it shows it provide substantial benefits to the validation process.

So the conclusion-- Veoneer here-- the successfully labeled Lidar data for verification of a radar-based automatic tracking system-- the output of the labeling process was used to train neural networks to fully automate the way to produce vehicle objects of interest.

And that can be used to find false negative events. So again, they could validate their radar sensors, a validation process to verify them. And obviously, it's a lot of time saving and cost savings as well.

So now, I'll talk a bit more about what the opposite, a V&V for AI systems. This field is more evolving. And what you see here is a trade-off between predictive power and explainability of the methods used that's leading to a desire to put these complex models in production in safety-critical situations through V and V methods.

But you see here on the left that naive Bayes gene classifier would be very simple. But usually, the predictive power, especially for complex applications like self-driving cars, it wouldn't even work anyway.

So deep neural networks sometimes are the only solutions. And sometimes, they're not. But they work better. The problem is it's a black box method. It's very difficult to explain to someone without a technical background, especially.

So yeah, you have to use that trade-off. Does it matter that people understand your methods or not? What is the accuracy you get? So that's the main problem here. And putting AI into production is challenging.

I mean, depending on the task, it could be deep neural networks, traditional machine learning. But obviously, we've reached a point where, in many, many situations, we like to put AI in production, even in safety-critical situations.

We sometimes have no other choice-- self-driving cars. The problem is we don't really have robust processes or tooling in place to explain, verify, and validate the operational robustness of AI, and even more so if you have safety and regulatory concerns.

So experts expect to validate the predictions. And regulators won't tolerate bad outcomes from black box AI. If something happens, they want to know the reason why. So they tend to not like this black box method so much. This is why we want to use V&V.

And I'll start with a bit of model-based design here. And first, I'll talk about the AI-driven system design in general. So let's look into the specific here. That's a simplified workflow for stages of the data preparation first. You need to clean the data, human insight. You can generate data when you don't have enough.

The modeling part is just one part of it. So you design your models. You tune them. Hardware accelerated training-- very often, depending you want to use computing, the cloud, et cetera. Interoperability as well-- very often, your model might come from different frameworks. You can import that into MATLAB, let's say.

And then simulation and tests-- so we're getting into V&V here. It's just part of the bigger system. You need to integrate that with complex systems usually. I mean, a self-driving car depend on the environment, obviously, with that.

Yeah, and then V&V, and finally, deployment. After you simulate and test all that, and you think it's ready to go, you need to generate code on the device. Maybe you want to have that on apps on a multi-part systemic sensor.

So what's important to note is that this workflow is not linear. In many cases, it's iterative, especially when testing uncovers improvements that need to be made. So simulate some tests, you're not so happy with it, you tune your model. You repeat the cycle many times until you very happy with it. And then you deploy.

So let me talk about the development workflow with model-based design. And you see some similarities with AI workflow. So this MBD workflow starts with doing your research and gathering your requirements. You see that here at the top. And then it's time to move to the design phase, so a bit below, where the algorithms and components are developed.

Then if you go further down here, you see the implementation step, you aim to obtain C, C++ code, HDL code, whatever you're using. And finally, the integration step at the end, is part of the bigger system.

What we want to do during all these steps, we perform continuous testing to identify problems early. This is a key behind model-based design. If you find an error late, it's going to cost you a lot more.

Working in the automotive industry, you test everything on computers modeling. When everything is fine, you go on a test/fail. When you find new errors-- of course, your models are not perfect-- then when you fine-tune that, you going on a real test track at site and see how that works, your engines.

So looking back at the two workflow we've seen, AI and MBD, we can easily map them together. So this helps first having a better understanding how AI will fit into model-based design.

So you see that here. So if you look at the AI workflow, we had the data preparation. We had an AI modeling step as well. So here, in MBD, that would be the physical components and algorithms, the equivalent of it, simulation and test, and finally, deployment. So you see similarities between the two.

So if you think of elements, you can have in a semantic model, where AI can play a relevant role, we can use AI for component modeling, especially those very high-fidelity models that take a long time to simulate. Eventually, we can speed up desktop and hardware in the loop simulation. You can use a data-driven approach. You may also use this approach when first principle models cannot be obtained. So AI can be used for modeling.

And the other thing is you might also use AI for collection of algorithms that are under development. In many cases, it's difficult to implement with all the methods. So eventually, we might want to deploy these algorithms to, let's say, a sensor fusion, object detection, et cetera.

OK. So why model-based design? Because again, early detection of design errors-- it costs money. So typically, this is where the errors are introduced. But you detect them later. And what we want with model-based designs, like towards the left, is detect the errors early so there's less of a gap between introduce and detected.

So as you might know, Simulink is a de facto standard for model-based design, especially in domains such as controls engineering or signal processing. So here is a great new feature. You can import now your trained deep-learning models into Simulink, which is great because Simulink feeds its model-based design workflow for V&V.

So you can have your AI networks imported directly from all the tools, such as TensorFlow as well. We have interoperability with, like I said, TensorFlow, PyTorch, all the frameworks. You could develop your models in MATLAB.

Or you could import them and fine-tune them, like I said, from TensorFlow/PyTorch, input them in MATLAB, make it a .mat file that you can reduce and simulate. And then this whole model-based design workflow for V&V is variable in Simulink.

So let me here play a quick video that's an example in which a deep-learning model is used to estimate a battery state of charge. So let me play that. So if you look inside the battery management system ECU block, we see that there's a set of charge estimation.

You get three different methods here. One of them is using AI. You see it at the bottom as NN. NN is Neural Networks. And you see here, there's your neural network right here, imported from TensorFlow. You could import from MATLAB too if you wanted to. And yeah, you just go import to TensorFlow model after you convert it that to a .mat file. You can easily do that.

And then using TensorFlow, we have a direct importer. Then you just import that into Simulink. And you're ready to simulate. So you can perform all your tests here. You can defer to hardware, et cetera-- so the whole V&V workflow.

Again, I think this is a big game changer that we've had recently. You can tune your model in MATLAB or import a model from other frameworks, import them into Simulink, which is the environment you want to use for V&V.

So a great new feature. And again, we have a unique code generation framework after that. You can allow models developed in MATLAB or Simulink to be deployed anywhere. You don't have to rewrite the original model. Automatic co-generation eliminates coding errors. It has an enormous value for organization who adopt it.

Ian mentioned I worked for Cummins. And back in the days, we had thousands of people writing C code by hand. And more and more, there's a switch towards automatic code generation, obviously. With millions lines of codes, they can't do that anymore. So the power and flexibility of our co-generation and deployment frameworks is really unmatched, that's the key.

So now, it's actually slightly different than these V mode diagrams that you had when you integrate AI components. But what you're doing, actually, is you can then perform system-level simulations to ensure requirements are satisfied and therefore deployable to the desired target platforms.

So again, I want to show you V diagrams. Don't worry about the details. Hopefully you're familiar with it. Like on the left, it's all the requirements. Towards the right, it's the testing side of it. So this is a V diagram for no AI.

And here, what we have-- we developed that with the European Union Aviation Safety Agency. So again, we actively engage with different committees, such as the EASA Committee on the W cycle for machine-learning development and V&V. So instead, what we have here is a W diagram. It's slightly different. You can see here, you go back up a little bit.

It's in the middle of each of-- engage with many partners. So the W-shaped development cycle for learning is shown here. Yeah, what you see, this component right here, I'll talk about it a bit later. It can include neuron coverage for more verification, prediction explainers. I'll get into these topics soon.

All right. So what is V&V without AI? Verification and validation-- so validation-- are my requirements correct? Verification-- does the implementation satisfy the requirements? So that's certification.

But for AI, we can have different meanings for V&V in AI. It can be more complex than just certification. One thing that comes in to mind is explainability. Why? Because again, when we use black box, it's not like using import equations or simple logic.

And can you explain the model? Can you explain the working of an AI model or system in human understandable terms to other people who are not AI specialists? Closely related is interpretability-- can you observe and trace cause and effect in an AI system and explain the rationale of the decision making?

Another topic here is robustness. When you test classical system, I mean, you can use some margins and things. But here, it can be more complex for the AI, like images, object for detection, image recognition, et cetera.

So is your AI system immune from spoofing and other common attacks to process reliable inputs? Because you could trick your AI system by adding a bit of noise if you know exactly how this type of systems are working already. You think you could recognize an object. And you can add a few features out of nowhere, a bit of noise, and it's not going to work anymore. How do you test that with AI?

Some other topics are data privacy. Can an attacker deduce sensitive training data from an output of AI model system? Rigor and trust, I mean, it's related to V&V here, again. Has it been defined a traceable and rigorous process? And finally, safety certification-- sometimes you need to deal with it, like ISO standards, et cetera.

So let's talk about some of this. I'll start with interpretability and explainability. So both terms describe the process for making a black box model understandable. So typically, interpretability is used for classic machine learning. And it's the causality of specific model decisions, whereas explainable AI often refers to deep learning, how the network actually works.

So you see here the trade-off again between predictive power and interpretability. What is interpretability? First, we want to overcome a black-box nature of a model. Second, you might have some regulatory requirements. So for instance, in Europe, they have a General Data Protection Regulation, for instance.

Now, If I get back to point number one, black box model, they're problematic, because sometimes, it's not acceptable by company guidelines. Or it could be the user preference. You also want to build trust for users who are not familiar with machine learning. And all things equal, you want to pick the model that looks at the right evidence.

And finally, the third point was debug models. Do you have biases in the data? How do you deal with that? Where are the predictions wrong? Why are they wrong when they're wrong? And also exploring "what if" scenarios, like these are working if I have this very unusual case, et cetera.

The goal of explainable AI, I mean, again, it's I want to understand why. I want to understand why not. I want understanding of why it succeeds, why it fails and showing that where I can trust you. I understand. You can give me some explanation of the cases that don't work versus the cases that work.

And for interpretability, it's closely related. We have a wide range of techniques. They depend on the type of data and AI, so machine learning versus deep learning. So this is a lot more here, using machine learning and deep learning as well. You see tabular, time series, images. I'm going to go through some of them here.

So one thing I'd like to mention, if you're familiar with our machine learning apps, it's kind of new, no? You actually have the ranking algorithms-- teacher ranking algorithms that are included in the apps.

So here, you don't need to understand much about machine learning. You try all methods. You can use hyperparameter tuning until you get the best results according to your needs. You pick that model.

But at the same time, you want to be able to explain it. So you can actually now, within the app-- you don't have to write code anymore-- use different feature ranking algorithms to see which features are actually important. That helps you understanding how the model make decision based on which features.

For deep learning, I want to show another example for deep learning instead of machine learning here. Can be quite different. Let's look at images because that's visual, easier to explain. So here, you see the coffee mug. But the algorithm says, oh, it's a buckle. So it failed.

But the thing is we can use techniques, such as Grad-CAM and occlusion sensitivity to understand what went wrong. And here, I see, it's highlighting in red where if finds features to make decisions. And suddenly, it makes sense. I mean, it's detecting a watch here. And this is what it's saying buckle.

A fun other thing you can do show occlusion sensitivity is basically hide parts of the pictures and see if you get different answers that would help you to make sense out of it as well.

So again, if you show that to some regulator who doesn't really understand AI much, well, at least, we can show, well, yeah, we understand what's going on. Just have one object in a picture. This is what you need to do. This is several of them. This is why it might go wrong. So you can actually explain your failure.

Let's have another example. Here is the visual. Why are my satellite pictures misclassified as pizza right here? They want to understand that. So the hypothesis is the network is focused here on the curving edges of the pizza, because I can visualize that here with my color map. That's what it's doing.

And then I'm like, well, does this work for other data than pizza images? And yeah. So the fix is to add more pizza slice images and salad on plates with curved edges. That will solve your problem. So again, having an explanation for your problems will help you for using that algorithm in a way that works better. So humans are happy. They can understand that. And they know what to do.

Now, let's move on to the next topic, which is robustness. So one of the techniques people use is neuron coverage. So it attempts to create a metric with characteristics similar to code coverage for neural networks.

So if you're familiar with traditional V&V, code coverage is a very important metric. We basically want to test every part of the code. And here is going to be totally different. We say that a neuron is activated if its output is higher than a user-specified threshold.

So let's look at it here more visually. It helps you ensure adequate test data. So traditionally V&V, this is your code coverage. Oh, this part is not covered by tests. So maybe I need to create a few other tests to make sure my system is validated and verified.

Here-- so you see here a deep neural network now up. What we're going to do is find out which neurons are not covered by test data. And then it's going to tell us where these neurons have not been activated. It's never zero.

So you need a threshold. This is where the analogy between neuron coverage and code coverage-- it's not exactly the same, because here, we said they have not been activated enough. It's usually above zero. But you need to specify a threshold.

So code coverage does not. It's tested or it's not. So based on that, yeah, you're saying, well, it looks like my test data is not good enough, the data I'm looking at. Maybe I need more training images.

But there are some cases here that are not represented yet. So it's a bit more complex, because it's not like a code, where you can say, oh, let's find a-- well, what do I need to do here for this line? But again, one thing you could do-- and this is why the example at the beginning-- you could use AI to find scenarios where you're going to be able to activate this.

And again, that first AI part is used to generate the right data. That one doesn't need to be verified. So it's fine. You actually use that to make sure you trigger all the neurons, make sure they reach a certain threshold, which verifies your actual system that has an AI component in it.

So some new sketches here. I want to investigate a network and test data limitations. So let's say I'm developing a visual landing guidance system floor plan. Neural network is going to be integrated into a control system to control the landing of the plane. The neural network will detect if there is a runway in the image and find its position.

But the user is concerned about using a neural network, because it's really a safety-critical domain here. So they would like to use a neuron coverage, in this case, to get more insight into what parts of the networks are doing. Why? Because you want to make sure you covered every case.

So basically, you want to make sure every neuron has been used in the data you used to train all this. So for instance, how do parts of the network behave if there's no snow cover, or if the plane is landing at dusk, or at night? So this is some examples, things you might want to look at.

Another thing about robustness is using adversarial inputs to help you determine if an AI algorithm is robust. So one concern when developing neural networks is robustness, again. And it has been shown that neural networks can misclassify inputs due to small imperceptible changes.

But if you have a runway here, and you just had some bit of noise-- and I've seen some very interesting presentations on that, actually-- I had to create some tiny features to make it wrong. It's not that complicated.

So you really want to make sure that if you had a bit of noise in your signal, in your images, you still going to classify this other runway. You'd be very surprised. I mean, I don't have time to go through that in a detail.

But you could just have the right type of noise, it would really look to the human eye that there's almost no difference between the two. But these tiny things might make an AI system that has been trained on good images go completely wrong. So you want to test for robustness, generate some noise, and make sure it's still working.

If you want, we have some examples of that. I'm showing you here a very simple one, detecting numbers 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, how to train all the different networks to make them less susceptible to adversarial inputs. We have some examples in the docs.

Finally, I get back to safety certification, which relates, again, to the beginning of my presentation on mobile-based design. Experts expect to validate predictions. And regulators won't tolerate bad outcomes from black box AI. So the way it's working at the moment, there are very few products in this area yet, not many, because best practices and methodologies are still being established at this point.

So instead, you have a standardization effort and white papers. So we're actively involved in that. Different industries are making progress on verifying AI in systems through whitepaper, standards, and planning. So for instance, automotive, you get the stats. You can go look at it.

But some examples of white paper in this. Aerospace, you know, again, I talk about the European Aviation Agency. Some concept papers, some standards expected in 2023. Medical devices-- a FDA action plan would be an example.

So that's what I like to remember here is we're working with partners around the world in different industry. And what do they want? When I work with you for V&V standardization, if there's something important, it should be in our tools.

So my message here is just know that these things are going to change a lot in the future. And I mean, please work with us. Contact us. And let us know what your needs are when you're interested in that, because it depends on different industries, different AI models. It's very dependent on each case.

I'll show you where it's been used, an example. So Khawaja Medical Technologies, they have to deal with IEC 62304. That's a standard for medical devices. And their challenge-- so they're detecting ECGs, or electrogardiogram, signals to detect cardiac abnormalities.

Again, this is critical. If you don't find one, someone might die, so important stuff. And so obviously, regulators want that to be certified. So there's ISO and IEC certification.

And they used model-based design to model, simulate, and generate production code for ECG analysis software. So they done all the modeling. You see here a signaling diagram. They used AI. And they could generate production code. So the development time was reduced by 40%.

And the certification was accelerated. The prototype could be built in months, not years. So you're going to end up seeing more and more of these examples in the future in different industries. Some of them, it's more critical, and they put more effort early on. Some of them, it's too difficult. But again, the message is get involved. Talk to us.

So other conclusion-- there's a trade-off between predictive power and explainability of the methods that are used, and as a result, what is desire to use V&V to put complex models into production when it's safety-critical. Putting AI, again, into production is challenging, especially when you have to deal with safety and regulation concerns, regulations.

And what you can do is you can integrate the AI components into your model-- you see the W diagram here at the end-- to ensure requirements are satisfied, and therefore, deployable to the desired target platform. But again, it's specific to one industry here.

So we really like to talk to you, see what your concerns are, what you issues are. And this is an important topic for us. We're putting a lot of effort into it. It is going to change a lot in the future.

So I just wanted today to make sure you are aware of some examples, and some of the issues people are looking at, and where we're headed. And thank you very much for listening. I'll take your questions now.