Fault Injection Testing and Simulation-Based FMEA
Dr. Marc Segelken, Principal Application Engineer, MathWorks
This talk will demonstrate an approach to fault effect and safety analysis conducted through simulation. The focus will be on a methodology that allows for the injection of faults into a system model without necessitating any changes to the initial design. This technique is applicable to a variety of modeling environments and is particularly adept at handling faults that are either timed or conditionally triggered by the system's behavior.
Participants will gain insights into how to examine the impact of faults using simulation inspection tools to assess the robustness of their systems. The talk will also address the execution of comprehensive safety analyses, including the industry-standard failure mode and effects analysis (FMEA), by leveraging the detailed insights that simulation provides.
The session will further reveal strategies for establishing clear and formalized connections between system faults, associated hazards, and the logic for fault detection and mitigation. These strategies are crucial for creating a thorough safety analysis framework that can be integrated into the overall system design process.
Published: 3 Jun 2024
Today, I'm excited to talk about our latest possibilities in the area of fault injection testing and simulation-based FMEA. FMEA is an abbreviation for Failure Mode and Effects Analysis. But having this in the title was too long, I was told. And it would have gotten even longer if we add the term criticality to it, which is then Failure Mode Effect and Criticality Analysis or always abbreviated as FMECA.
And these are exactly the first part of what we are talking about today, those safety analysis, specifically in those tables. And the other part would be fault injection testing. This is about modeling faults for your system, for your architecture, without modeling the architecture, like without committing any changes, and being able to simulate it without burning any hardware. So completely doing this virtually to check whether your system is robust against faults that would cause otherwise, maybe, severe reactions.
So since in the overall process, safety analysis comes first and only after this fault injection testing is following, this is exactly the order of things we're talking about today. Most important is that you will see that everything will be connected between the different artifacts, the different tables, between the models and faults. So this is the main other part that is not listed as an agenda item because it's part of both agenda points that we have listed here.
So today, you will learn about how to conduct or improve safety analysis, considering hazards, failure modes, model faults, simulator faults, and having full traceability over the whole process with a detailed connection between all the artifacts, which we sometimes also call a digital threat if everything is connected in such a way.
So first of all, maybe some overview, the big picture. How is the connection in the overall process between safety analysis and fault injection testing? So we have a strongly simplified process being shown here, which starts at the point of systems engineering.
And in systems engineering, safety analysis is an important part to identify hazardous events and define mitigation strategies. Typically, this is covered in the table, Hazard Analysis and Risk Assessment table, typically abbreviated with HARA, where all this information goes in.
Once the architecture is defined, including the safety concept that contains all the mitigation strategies, we can refine the safety analysis going on the next level of implementation level to come up with the FMEA of FMECA tables to define the details about the failure modes and the effects based on the different components or subsystems of the architecture level.
A key requirement here is always, especially whenever we have mitigation strategies in place, some countermeasures that the implementation should take care for, we need to verify that the detection mechanism is working in the implementation once the implementation is finished. And this is exactly what fault injection testing is all about, double checking whether these strategies in the safety analysis part of FMECA are working properly to verify this. This is exactly fault injection testing.
As you notice, the safety analysis is in parallel with the architecture development and with the implementation. And this just emphasizes, again, that this is going hand in hand. It's not isolated processes. It's really working together with developing the architecture and the safety concept or the implementation with detection and mitigation strategies.
It is not uncommon that, for example, during the fault injection testing phase, you notice some gaps, things that haven't been taken care for. So in that case, you would need to iterate back to the safety analysis phase that were done before. So this is very typical.
So this overall process, simplified process for safety analysis, looks very similar if we would talk about security analysis, just based on threats and possible attacks that we would need to take care for. But the basic structure would look the same.
So let's compare the situation that we want to have with how it's done today. Typically, today, engineers are using Excel tables to manually fill out most of the content. Some content is sometimes semiautomatically filled in, but most part is manual work. And typically, we have some loosely maintained links to, for example, the architecture based on file links or some string path definitions and stuff like this. And the same thing for linking to the detailed design logic and to the requirements.
So the problem with this is, of course, that it's just loosely maintained links, so you can easily run into inconsistency issues, for example. So typically, nowadays engineers are automating part of the work, at least for some semantical analysis on the table by, for example, coming up with a visual basic or Python scripts to do some checking on those tables.
The problems are, with this approach, it's totally decoupled from the design work. It's very complex. And of course, this way, as typically things like this are being very error prone to manual error. So how can we improve the situation? How would model-based safety analysis looks like?
So here's a quick picture that we will build up incrementally in the subsequent steps in the strongly simplified workflow. The point is that with model-based safety analysis, we have a fully integrated solution with the design, which is fully traceable between all the different artifacts, including the models, faults, and that stuff.
It's easy to provide some consistency and validation scripts to check whether the tables that we are creating here are meeting some consistency and completeness guidelines. And later on, you will see that this even allows us to reference, later on, the faults, the modeling of the faults, even the simulation results in this table here as well. So everything comes together.
So let's see how we create this overall picture here. Let's start with, where do the tables come from? Here, we have the possibility to start from some templates that are already defining the structure, the different columns that, of course, you can extend, you can fill in your own content here. You can provide links to any artifact that is relevant. This is all possible here.
And we're also shipping some validation checks already to make sure that your tables are consistent and complete once you've filled out all the content that is required here. Let's take a simple example of a fault tolerant fuel control system and see how this looks in a simplified workflow. Let's take an FMEA table as a template.
And you see here on the top, the table is already defined. We have already some predefined data types being used in the table. Plus, we also have predefined analysis functions, again, checking consistency, completeness, and whatever else you want to check here at this point.
By the way, the terminology we're using here right now comes typically from the aerospace environment. Templates for the automotive area will be provided as well. And, of course, you can define your own templates already nowadays. So whatever needs you have here, you specify your own template according to your needs.
So once we have come up with a table, now the next step is to fill it out. So let's do the hazard assessment. So this works like this.
For all the different system functionalities, we're identifying the possible hazardous events. And depending on the operational scenario, we are describing the effects of it. And this way we're able to classify the risk category here and add any other information to this table that is helpful in this context.
Most parts of this table have to be filled out manually, of course, because it's manual work, but not all of it. Because we're in a model-based safety analysis environment, we want to have things automatically being filled out as far as possible. Meaning, for the linking of functionalities and the architecture, for example, we can just use the links and have the possibility that it's always synchronized with the model structure that we have here.
So this will be the first part of our safety assessment, safety analysis. Next step would be to refine the safety analysis with this FMECA approach, Failure Mode Effects and Criticality Analysis, now referring specific components on the architecture level, being much more precise. This is the FMECA table now that we are also, again, creating from a template, which is linked to the architecture, as well as to the hazardous events we specified in the other table already.
So in this table later on, we can populate with all the details. So what's the difference between both tables? Again, HARA, hazard assessments, is referring to system functionality and what can go wrong there. While FMECA is a more detailed look into certain components, analyzing the causes and effects this has, also, of course, assessing the criticality, and later on allowing for prioritization and defining the strategies for mitigation or avoiding of the risks that are involved with this.
So next thing we need here is to come up with faults that are connected to the failure modes. So let me talk about how to model faults in such a way that we can not just link it here to the table, but even later on be able to simulate it.
In former times, it would have been done like this, because there was no other way. In former times, you had to model your Simulink model, your architecture to explicitly put in what's called a fault adapter, that is realizing this fault injection by being based on, for example, control of a parameter that decides whether either it's propagating the nominal behavior signal or replacing it with some faulty values.
So this was how it's done in former times. But, of course, this approach has a lot of downsides. First of all, it modifies the design, especially if you want to generate code out of it. That's not what you want to see in the code.
Second, it can accidentally change the simulation behavior without you noticing it. Next, it's difficult to analyze the effects because it's not connected to nothing, so you don't whether this mechanism here, for example, in the end, results in some successful application of detection and mitigation strategy. And next, of course, how do we relate this to the hazard tables that we have already created? So there's a lot of information to keep track of, which this approach doesn't help you with, of course.
But this was the past. As of '23 B, we're shipping a new product called Simulink Fault Analyzer that helps you a lot with this. And especially with the way to model faults, this has now become much more elegant, because the advantage now here is that you don't have to modify your architecture or model at all anymore.
It's completely different location for the faults on the right-hand side being modeled in a fault model library. So here we model what exactly the fault behavior is like. And it's just linked to the signals or to the inputs or outputs on component on the left-hand side.
And it's different files. And also the link information is not part of the model, but part of the faulty behavior. So no changes to the model needed, no new timestamp, no nothing.
Someone who is modeling the faults doesn't have to touch the model. Apart from the fact that everything is done in the user interface, so it's still very convenient to do this. Let me show you the steps how to do this in the subsequent slides to give you a better understanding of how this works.
So choosing one of the signals. For example here, this ego input of this component. We want to introduce another fault, marked by this flash symbol, by the way. And then the graphical user interface with the dialog comes up, which allows you to specify the three main components about where exactly the fault is supposed to happen. So that's typically connected to the port, in port output. You still have a choice if you have multiple connected components here.
Then how exactly the behavior is being injected. So here we have a number of shipping defaults that you can use if you like, but you can also specify your own behavior. We will see this soon in some video. And third, when this fault is to become active, triggered, based on time or conditions.
So let's see some small video outtakes of how this works in practice. Here, you'll see how we can, first of all, create another fault, for example, for this specific in port. So there's already one there. We're adding one.
You can have any number of faults. We're, again, specifying which port along the signal connections. We can give it a name, of course. And this is already the "where."
We can specify the behavior later, if we like. We're doing it now already. The fault has been registered. And would also be displayed in another table that's not part of this video now. You will see it later, where all the faults are being captured and you can selectively activate or deactivate them.
So next thing is the behavior. As I said, we have a number of defaults that you can choose from or, say, you want to create it on your own. In that case, you can use a normal, small Simulink subsystem that is part of the fault model library where you can specify what exactly should happen once this fault is active.
And, again, it's a separate fault model library. We're not changing the model that we're talking about here. So this is how a fault model, for example, could look like, that we could also reuse in different parts of our architecture if we like.
So coming back to our tables here. That's exactly the point now where we have defined the fault, including the fault semantics. And we're able now to add it to our tables, link it in that case to the FMECA table, to the failure mode, because this fault is triggering is causing this failure mode, in that case.
Of course, we have to fill in all the other parts of the FMECA table, as I said, criticality, priorities to come up with strategies that we have to define how to either mitigate or avoid the risk we're talking about here. In case we're talking about mitigation strategies, we always need also a detection mechanism, which is very important as part of our implementation, because we first need to detect the situation in our system in order to activate mitigation strategies, counterexamples, react in some reasonable way to capture this situation.
And, of course, we also want to exactly link this detection mechanism because we have now an implementation at this point already. In that case, here is state-flow chart that has the task of identifying the default occurred, and also has to identify the failure mode connected to it, and kick in this mitigation strategy by going to a fail-safe procedure, for example.
So in case you're a little bit confused by so many tables and links in between everything, there is an easy solution to keep the overview, by simply using the power of MATLAB Graph, for example, to create such a graphical representation of all the different artifacts where the links are going to. So this way you can easily navigate in between all your different artifacts that are linked somewhere in your tables. This you can always do apart from all the other links that you can use to navigate forth and back in between the tables themselves. So this way you're not losing overview.
So another thing I would like to point out here. I mentioned already that we have this validation checks that are even coming with the templates. And you're, of course, able to extend them or to define your own validation checks for your table.
The purpose is to make sure your table is consistent and complete with all the contents you've provided. That, for example, nothing is simply just missing because you have empty cells or contradicting some other information in your table. So think of this as kind of like modeling guidelines checks, but not for the model, but for the tables we have to make sure that you really capture everything that's needed in those tables.
So going back to the fault definition. We saw already how to specify the where the fault is to be injected, what's the behavior of the fault? Now, we're talking about the when the fault is becoming active. And this is the bottom part of this dialogue that we saw already.
And you see here, for example, in this short video that you can choose between time-based, like at a specific time after simulation starts, conditions, or even manual triggering is possible if you are conducting your simulations in an interactive way, for example. This is also possible.
A nice thing to specify this when is based on this condition, because it allows more flexibility. In that case, we're referring to certain conditions, like in this example. If the speed sensor gives signal value that is above 10, we want to activate a fault that occurs exactly in this situation.
So this is working by, first of all, giving the condition a name, in that case speed high, specifying that some variable has to become greater than 10, and then the next step linking this expression to the corresponding signal in the architecture. In this case, the speed input where we're just selecting the signal. And this maps the condition on the speed.
And then in the end, we just need to say, this is our condition, speed high. That's the name we provided here. So this fault is activated whenever the speed is above 10. So something then breaks or has a strange value or whatever. That's the connection between the timing here and the fault injection.
So once we have defined all this, like the faults, the behavior, exactly the timing, when it's supposed to happen, we can now run the simulation. And let's take an example where the O2 signal is being stuck after some time of 5 seconds here in this example. We can use the normal Simulink Data Inspector to see the outcome of the simulation.
So we have all the different displays here of, especially, the top one. Is the fault, is it active or not active? And in the middle, you see the detection mechanism. Does it recognize it or not? And the bottom, you see how the signals are being evolving over time.
And you see here that after the 5 seconds, the fault is injected. The control logic is detecting this fault as well. And it's reconfiguring the fuel model to be rich. So mitigation strategy kicking in, solving the problem of the fault that occurs here. So that means in this situation, this system reacted as it should be.
And that means especially for the FMECA tables, for example, that we could now check mark this part, because the whole thing here was tested by simulation. And we saw implementation is doing as it's supposed to be. And it allows us to say in this corresponding line in the FMECA table, this has been verified.
Now, we have, of course, bigger tables with lots of entries. And we have the nice button, Run For All Faults. So we can automate the whole thing completely automatically, just based on the linking, because this FMECA table links already the fault or maybe several faults if we have multiple failures we want to simulate.
So the simulation automatically going through the table row by row would activate the faults being linked, run the simulation. And based on the detection mechanism, which is also linked. For example, linking to a certain state in your state-flow implementation that recognizes the fault situation.
We know automatically after the simulation result, was it recognized, detected, or not? And based on this information, we can automatically populate the table with either green or red buttons, saying this verified the detection and mitigation strategy, or red, this was not detected by the algorithm so we need to maybe improve something here in the implementation.
So this way, we can automatically populate the whole FMECA table with all the links inside without any additional work, except for linking faults and detection mechanisms. So with this explanation and the simplified workflow, I'd like to conclude already and give a summary of what we just learned here.
So you saw how you can use Simulink Fault Analyzer, which is a new product that allows all what we just saw in the session. How to, first of all, capture the analysis by using templates to link to the architectures, performing validation checks. A very important aspect to model the faults, again, without modifying the design and specifying all the details, the characteristics of the fault behavior, in order to be able to simulate and check whether the fault is correctly detected.
And finally, of course, we can use it to automatically populate the safety tables, like FMECA or FMEA tables, with the results to have one artifact that you can use to prove to show that you've done the whole analysis and verified already the implementation, that this is working as expected.
So if you want to more, apart from the Q&A where we have maybe time for a few questions, outside, we also have a booth that has a similar title as this talk. And I'm happy to answer any further questions also there, if we cannot address them here. So thanks for your interest in this topic.
[APPLAUSE]