Does Software have to be Ultra Reliable in Safety Critical Systems?

It is difficult to demonstrate that safety-critica l software is completely free of dangerous faults. Prior testing can be used to demonstrate that the unsafe failure rate is below some bound, but in practice, the bound is not low enough to demonstrate the level of safety perfo rmance required for critical software-based systems like avionics. This paper ar gues higher levels of safety performance can be claimed by taking account of: 1) external mitigation to prevent an accident: 2) the fact that software is corr ected once failures are detected in operation. A model based on these concepts is de veloped to derive an upper bound on the number of expected failures and accide nts under different assumptions about fault fixing, diagnosis, repair and acc ident mitigation. A numerical example is used to illustrate the approach. The imp lications and potential applications of the theory are discussed.


Introduction
It is difficult to show that software has an acceptably low dangerous failure rate for a safety-critical system. The work of Butler and Finelli [4] and Littlewood and Strigini [21] suggests that there is a limit on the rate that can be demonstrated by testing. Given the difficulty of testing the software under realistic conditions, it is often claimed that this limit is likely to be around 10 −4 to 10 −5 failures per hour [18]. As these tests would typically be completed without any failures, we do not know what proportion of the failures are likely to be dangerous, so we have to use the bound derived from testing as the upper bound for the dangerous failure rate as well.
The safety requirement for a software-based system can be far more stringent than the bound established by testing the software, e.g. a target of 10 −9 per hour for catastrophic failures is required for an individual avionics function [12,13]. The magnitude of the gap between the demonstrable failure rate and such targets can be illustrated in the following example. With a demonstrable catastrophic failure rate of 10 −4 /hr per system, 100 critical systems per aircraft and 5×10 7 flight hours per year for 2 civil jet airliners world-wide, around 500,000 fatal aircraft accidents could occur every year. In practice, statistics on aircraft accidents [2] show that there are around 40 airliner accidents per year worldwide from all causes (including pilot error).
It is clear that real avionics systems perform far better than we are entitled to expect based on testing alone [25], but the actual performance can only be determined after the system has been deployed. What we need is a credible argument that would convince a regulator that a software-based system is suitable for use in a safetycritical context before it is deployed in actual operation [11].
One alternative to empirical test evidence is the claim that compliance to established standards for safety-critical software will result in software with a tolerable dangerous failure rate. For example, compliance to IEC 61508 Level 4 is linked with dangerous system failure rates as low as 10 -9 /hr [9]. Unfortunately there is very little empirical evidence to support such a claim.
More credibly, it may be possible to support a claim of perfection if the software is proved correct using formal methods [8]. In this case any failure rate target, even a stringent target like 10 -9 per hour, would be achievable and the Probabilistic Safety Assessment (PSA) of the overall system could assume the software had a dangerous failure rate of zero. In practice however, few systems have been constructed using formal proof methods, and even these systems cannot be guaranteed to be fault free (e.g. due to errors in requirements or faults in the proof tools [6,20]).
Another alternative is a risk-informed based design approach [19] which focuses on reducing dangerous failure modes rather than seeking software perfection. Potentially hazardous failure modes are identified and safeguards against these failures are included in the design. However there is no guarantee that a hazard-based approach will identify all potential hazards in the real-world environment, and a convincing safety argument would need to show that the hazard identification is complete.
Safety assurance can also be achieved by the use of fault tolerance techniques [1], [14] like design diversity [22] that mitigates failures from individual software components. Software design diversity can reduce the dangerous failure rate of the composite system as the same failure has to occur in more than one software component before it becomes dangerous. These techniques have been used in a range of safetycritical systems [3,15].
It should be noted that all these strategies for producing safe software are vulnerable to an error in the original specification, i.e. when there is a mismatch between the software requirement and the real world need. This unfortunately also limits the potential for accelerated testing of software against the requirement to reduce the dangerous failure rate bound as the tests will omit the same key features of real-world behaviour.
In practice, systems designers use the defence-in-depth principle to mitigate the impact of dangerous failures in subsystems [7,10,20], for example, • A nuclear protection system failure is covered by an independently designed secondary protection system, manual shutdown and post incident control measures.
• A flight control system failure is covered by diverse flight controls and pilot intervention.
As a result of these strategies, the dangerous failure rate of the system function can be lower than that of any individual software component. This can be formalized as: where λ acc is the accident rate, p acc is the probability that a dangerous subsystem failure will cause an accident and λ is the bound on the dangerous software failure rate established by testing.
The problem lies in achieving and justifying an appropriate value of p acc . Empirical studies of software fault tolerance techniques like design diversity [16,26] suggest that reductions of no more than two orders of magnitude can be achieved, while human intervention under stress may only result in another order of magnitude reduction in the dangerous failure giving a total reduction of 10 −3 . If a dangerous failure rate of 10 −4 /hr can be demonstrated from actual flight testing, it might be argued that the accident rate due to failure of the avionics subsystem is no worse than 10 −7 /hr, but this could still be insufficient to demonstrate the required target (e.g. 10 −9 /hr for an avionics function).
In this paper we will present a probabilistic software failure model that can be used to claim a lower contribution to the accident rate from dangerous software faults. This approach is novel and potentially controversial as it requires certification bodies to accept an argument based on a low average risk over the system lifetime, but with the possibility of a higher instantaneous risk when the system is first introduced.

Basic Concept
Software failures need to be handled in a different way to hardware failures because a systematic software defect can be fixed-once we know what the problem is, it can be removed. In the best case, each software fault need only fail once if it is successfully fixed in all instances immediately after failure, so we consider that it is inappropriate to use a fixed failure rate for software in a safety justification. We also need to take account of the fact that software failures need not be catastrophic (i.e. cause an accident), because there can be mitigations outside the software-based component.
In the most basic representation of this idea, we consider the failures caused by a single fault in the software (the impact of multiple faults will be considered later).
Clearly the number of failures that occur before the fault is successfully fixed depends on the probability of diagnosing a fault and then repairing it correctly [24]. In the basic model, we make the following assumptions.
• The conditional probability that a fault is diagnosed when a software failure occurs is p diag .
• The conditional probability that a fault is repaired correctly after diagnosis is p repair .
• The success of diagnosis and repair is independent of the number of previous failures.
• No further failures can occur in any software instance until the fix attempt has finished.
Given the independence assumptions made on diagnosis and repair, the probability of a fault being successfully fixed after each failure is: Given the assumption that no failures can occur during a fix, a simple transition model can be used to model the fixing process as illustrated in Fig 1. Fail (1) OK p fix

Fig. 1. Fault correction model
So for p fix = 0.5, there is a 50% chance of removing a fault after the first failure; 25% after the second failure; and so on. The mean length of this failure sequence, n fail , is: Using a standard result for this geometric series [27], this reduces to: The other element of the model is based on the fact that safety-related computer-based systems typically operate within a fault-tolerant architecture (as discussed earlier). We can represent this external mitigation of a dangerous software failure by the probability p acc that an accident occurs after a dangerous software failure.
It follows that the expected number of accidents over the lifetime of the fleet due to N dangerous faults is bounded by: This equation implies that if N<<p fix ,/ p acc , the expected number of accidents n acc <<1. This represents the case where there is a high probability that all N dangerous faults will be diagnosed and removed before the first accident occurs. A value of n acc well below unity is effectively equivalent to the probability of an accident over the lifetime of the fleet due to failures of the software component.
Given the assumption that no further software failures occur during a fix attempt, the failure rate of the software has no impact on the maximum number of accidents in the fleet. This assumption could be satisfied if all failures resulted in an instantaneous fix attempt, or more realistically, it could be met if the fleet was grounded immediately after failure while the fix attempt is made.
This independence between the upper bound on failure rate and the number of accidents is particularly useful in cases where the failure rate bound has not been estimated correctly, e.g. due to a flaw in the specification. Such a flaw would invalidate any failure rate estimate based on testing, but the accident bound derived from equation (6) would still be valid provided N included an estimate for dangerous specification flaws. This differs from hardware where the instantaneous failure rate is often assumed to be constant, so expected accidents always increase with fleet usage.

Impact of Delayed Fixing
The basic model makes a strong assumption that no further failures will occur after a dangerous failure is observed. In many cases however, the fleet containing the software components will continue to operate after the failure has occurred. Clearly, if repair is delayed, further failures could occur within the fault fixing interval. Initially we will consider the case of a single fault (extension to N faults will be addressed later). Let us define: λ as the upper bound on the software failure rate ∆t fix as the time needed to perform diagnosis and repair τ(t) as the total execution time of the software fleet at elapsed time t We further assume that: • No new faults are introduced when the software is fixed.
• The failure rate bound λ is unchanged if the fix attempt is not successful.
These assumptions are also quite strong. New faults are known to occur occasionally but if the new fault is in the same defective code section, it can be modeled as the same fault with a reduced p fix value. The second assumption is conservative if the rate actually decreases after a repair (e.g. due to a partial fix). An increase in failure rate would not be conservative, but it might be argued that the rate is bounded by the execution probability of the faulty code section. The assumption that the failure rate is unchanged by unsuccessful fixes makes this process mathematically equivalent to fixing a fault with failure rate λ with probability p fix at a time ∆t fix after a failure was observed.
To estimate the impact of delayed fixing, we first define the time needed before an average length failure sequence terminates, t fail , as the value that satisfies the equation: From equation (4), this is equivalent to: It can be shown using Jensen's Inequality [17] that, if the execution time function τ(t) is convex (i.e. the gradient is constant or increases over calendar time), the total number of failures, n fixedl , when the fix delay is included is bounded by: ) ( fix fail fixed t t n ∆ + ≤ λτ (9) We now consider a situation where there are N dangerous faults. Most reliability models assume the failures of the individual faults occur independently. If this assumption is made, the failure rates sum to λ, but we will take a worst case scenario where: • The failure rate of each fault is λ.
• Failures occur simultaneously for N faults.
• Only one fault can be fixed after a failure.
In this worst case scenario there will N times more failures than a single fault and the failures will occur at the same frequency, λ, as the single fault case. This failure sequence is equivalent to having a single fault with a p fix probability that is N times worse, i.e. where: N p p fix fix = ′ (10) The bound in equation (9) can therefore be generalised to N faults as: So from equation (10), t′ fail has to satisfy the relation: (13) It follows that the worst scale factor k due to delayed fixing, is: Application to Demand-based Systems. Equation (6) is directly applicable to any demand-based system if the fleet is grounded during a fix attempt. The increase k caused by delayed fixing can be calculated for demand-based systems using equations (13) and (14) provided we can derive an equivalent failure rate bound λ.
For example, if we know there is an average demand rate of d demands per unit of execution time and the upper bound on the probability of failure on demand is f (e.g. from accelerated testing), the effective failure rate per unit of execution time is: fd = λ (17)

Theory Applied to an Avionics Example
To illustrate the potential gains achievable by including fault fixing, we will apply the theory to a hypothetical avionics example with the following parameters Note that the figures used are only estimates, but are considered to be realistic. The number of dangerous faults N is taken to be one as we assume thorough levels of testing and analysis (especially for the safe-critical portions of the software). The assumed failure rate represents a year of realistic flight testing (e.g. in ground based tests and actual test flights). The probability of an accident p acc is assumed to be small because an aircraft is engineered to tolerate failures of specific components (via standby systems, pilot override, etc). The p fix probability is actually the product of diagnosis and repair probabilities, i.e. p fix = p diag ⋅p repair . For critical software we expect the repair probability achieved by the software support team to be close to unity, so p fix is largely determined by the diagnosis probability, which is estimated to be around 0.1 as any hazardous incidents will occur in-flight, and diagnosis relies on later reconstruction of events based on in-flight recording data. If the fleet is grounded after a dangerous failure, the basic model applies and we would expect 10 failures (from equation (3)) and 0.01 accidents (from equation (6)) over the fleet lifetime.
If there is delayed fixing, the k value has to be computed using the execution time equation. With a linear growth in the fleet of avionics units at c per month the execution time function can be shown to be: We can use equations (13) and (14) and the execution time function (18) to compute the scale-up factor k. The impact of different fix delay times (∆t fix ) on k and the expected number of accidents is shown in Table 1.  (19) where T fleet is the total execution time of all the avionics units.
From equation (18), if unit sales continued for 5 years, the total number of operating hours, T fleet , is around 1.6×10 7 hours. So the upper bound on the mean accident rate λ acc over the fleet lifetime for a 3 month fix delay is: λ acc ≤ 1.3×10 -9 accidents per hour This bound on the mean accident rate is close to the target of 10 −9 accidents per hour required in avionics standards [12] [13]. The bound could be reduced to less than 10 −9 accidents per hour if a shorter fix delay is used (e.g. 1 month).
By comparison, if we only relied on external accident mitigation, the bound on the mean accident rate would be the same as the initial rate p acc λ. For the avionics example, the expected rate would be 1×10 -7 accidents per hour.
So for this choice of model parameters, the inclusion of a fault removal model has reduced the expected accident rate over the fleet lifetime by two orders of magnitude. Clearly the reduction varies with the parameters used. Table 2 shows the accident reduction r achieved by fault fixing for different failure rates assuming a 3 month delay in fixing and a 5 year operating period. It is apparent that the greatest reduction occurs when the software failure rate λ is high. This is not surprising as the mean accident rate is relatively stable (regardless of λ) when there is fault fixing, but it increases linearly with λ without fault fixing.

Discussion
The fault fixing model predicts an upper bound on the total number of software failures (and associated accidents) over the fleet lifetime. The impact of fault fixing is greatest in large fleets where the expected number of failures without fixing would greatly exceed the expected number with fault fixing. In the avionics example, the claimed accident rate can be two orders of magnitude less than would be predicted from testing alone. In particular, once a relatively modest (and demonstrable) level of software reliability is achieved, further reductions in failure rate make little difference to the ultimate number of failures and accidents. This type of probabilistic argument is not currently accepted in safety standards or by certification and regulatory bodies. Early users of the system could be placed at greater risk if the instantaneous failure rate is close to the limit established by testing. However it might be more acceptable as a support to a primary argument such as a claim of zero faults in critical portions of the software. The supporting argument would be that, even if the claim of zero dangerous faults is invalid, there is high probability that a software fault never causes any accident over the lifetime of the fleet (e.g. 98% in our avionics example).
If the theory is valid, equation (6) can also be helpful in choosing design trade offs. We note that an order of magnitude change in the predicted number of accidents can be achieved by an order of magnitude change in either: N, p diag , or p acc . Knowing the contribution of these parameters, design trade-offs can be based on cost and technical feasibility. For example, sending extra data to a shared black-box data recorder to improve p diag might be more cost effective than additional effort to reduce N. Alternatively installing a backup system using different technology might double the cost but improve p acc by orders of magnitude.
The theory also shows that the operational context can affect the accident probability. Obviously the repair probability p diag directly affects the number of accidents, and we can minimise the scale-up k due to delayed fixing by considering equations (13) and (14). For example k might be reduced by decreasing the fix delay time ∆t fix or by reducing the growth in usage τ(t) for some trial period.
To successfully apply the model, evidence will be needed to show that the model parameter estimates are either realistic or at least conservative. Values, like N, could be derived from past experience with similar systems, (e.g. analysing FAA directives [5,25]) but further research is needed on quantifying the model parameters.
More generally, the same theory should be applicable to any systematic design fault that is amenable to fault fixing such as software security vulnerabilities, hardware design faults or requirements faults.

Summary and Conclusions
This paper has presented a basic fault-fixing model that shows there is an upper bound on the expected number of dangerous software failures if faults are diagnosed and fixed. If the fault fixing is immediate, this bound is independent of the software failure rate. When this bound on failures is combined with external failure mitigation, there can be a high probability that an accident is never caused by dangerous software failure regardless of the size of the fleet using the software-based component.
We have also presented a refinement of the basic model that allows the bound to be increased to allow for a delay in fixing a detected fault. This revised bound is dependent on the software failure rate, but the increase is typically quite small.
The theory was illustrated by an aircraft avionics example where fault fixing reduced the expected number of accidents by around two orders of magnitude over the fleet lifetime.
If the assumptions behind the theory are valid, it could provide an additional means of arguing that critical software-based systems are safe prior to deployment even though ultra high reliability of the software cannot be demonstrated by prior testing.
The theory might also be helpful in making design and support trade-offs to minimize the probability of an accident.
We also suggest that the theory could be applicable to any systematic fault (e.g. in requirements, hardware, software or mechanical components).
Further empirical research is recommended to validate the model assumptions and quantify the model parameters.