Deriving a frequentist conservative confidence bound for probability of failure per demand for systems with different operational and test profiles

Reliability testing is typically used in demand-based systems (such as protection systems) to derive a confidence bound for a specific operational profile. To be realistic, the number of tests for each class of demand should be proportional to the demand frequency of the class. In practice however, the actual operational profile may differ from that used during testing. This paper provides a means for estimating the confidence bound when the test profile differs from the profile used in actual operation. Based on this analysis the paper examines what bound can be claimed for different types of profile uncertainty and options for dealing with this uncertainty. We also show that the same conservative bound estimation equations can be applied to cases where different measures of software test coverage and operational profile are used.


Introduction
Nuclear protection systems are designed to protect against a range of safetyrelated plant incidents (known as postulated initiating events or PIE). A PIE can affect one or more plant parameters (such as temperature, pressure and neutron flux). These plant parameters are monitored by the protection system and the reactor is tripped if the plant parameters go outside the safe operational envelope. In the UK, a probabilistic safety assessment (PSA) is required to justify the safety of nuclear plant. As part of this process, the performance of the protection system must be quantified in terms of probability of failure on demand, pfd, where the demand can be any of the PIE events. There are accepted means for estimating the pfd arising from hardware failures, but we also need to include an estimate for the pfd of the software if the protection system is computer-based. Statistical reliability testing [1,2] is one means of estimating the software pfd of a demand-based system to some confidence bound, and it is recommended in functional safety standards such as IEC 61508 [3]. For example, reliability testing was performed as part of the independent confidence building programme required by the UK Office for Nuclear Regulation (ONR) for the computer-based primary protection system (PPS) at Sizewell B nuclear power station [4]. The PPS was subjected to 5000 simulated demands to support a pfd claim of 10 −3 . Reliability testing is also planned for new nuclear power stations to be installed in the UK [5]. The confidence bound derived from statistical reliability testing is based on a number of modelling assumptions. The stated assumptions in IEC 61508 [3] for the low demand rate case are: 1. The test data distribution is equal to the distribution of demands during on-line operation. 2. Test runs are statistically independent from each other, with respect to the cause of a failure. 3. An adequate mechanism exists to detect any failures which may occur. 4. Number of test cases N > 100. 5. No failure occurs during the N test cases.
The second assumption can be met in the protection system context as the protection system is normally reset after a reactor trip (so the software always starts from the same initial state).
The third assumption is requires a perfect "oracle" that determines if a failure has occurred. The required response is relatively easy to determine for PIE events in a nuclear plant since each simulated PIE is expected to result in a reactor trip.
The last two assumptions will also be met in a nuclear protection context as many thousands of tests are needed for the required pfd and the software has to be corrected and retested from scratch if a failure is observed.
To satisfy assumption 1, the number of tests for each class of demand (i.e. for each PIE) should be proportional to the demand frequency of that class during operation, so the confidence bound estimate cannot be used if the test and operational profiles differ. This paper presents a means for estimating the confidence bound when the test profile differs from the profile used in actual operation. Based on this analysis, the paper examines what bound can be claimed for different types of profile uncertainty and the options for dealing with this uncertainty.
We also show that the same conservative bound equations can be applied in contexts where the software reliability bound and input profile are characterised in different ways.

Problem Statement
If a system is subjected to N test demands without failure [1], we can follow the approach suggested by Neyman and Pearson [6], Neyman [7], Clopper and Pearson [8] as it is presented by Wang [9] and identify an upper confidence bound, q, on the probability of failure on demand Q to a confidence 1 − α as the largest value such that the hypothesis "H0 : Q = q" is not rejected against the alternative "H1 : Q < q" at the significance level α.
Thus, q must satisfy the following equation: However, it is often the case that the system handles different classes of demand, e.g. a protection system that protects against different PIE events. These demand classes are assumed to be disjoint, i.e. only a single demand can occur at any point in time.
Testing over a series of classes can be characterised by a test plan vector: where m is a number of demand classes, n i is the number of tests for demand class i, and the total number of tests is: The distribution of tests over the demand classes can be characterised by a test distribution profile vector:p wherep i = n i /N, i = 1 . . . m When this multiple demand class system is used in operation it will be subject to an operational profile: Ideally the operational and test profile distributions will match so that p =p. However, in practice the operational profile p will vary if the system is used in different environments or there is uncertainty in the likelihood of different external events. So we need some means to determine a bound q s to some confidence (1 − α) for a different operational profile p given a prior set of tests n.

Problem Formulation
For some (unknown) vector of demand class pfd s the likelihood of observing no failures with test plan n is: The (1 − α) confidence area for all possible pfd vectors, q , is For an arbitrary vector of demand class pfd s q and operational profile p, the system pfd, Q S , is simply the weighted average of the vector of q values, i.e.
The confidence area (8) constrains the set of permissable q vectors and induces a confidence interval for Q S with the upper bound: We therefore need a method for solving (10) for an arbitrary demand profile p.
It is straightforward to solve (10) numerically for any profile p and test vector n. However a numerical analysis does not permit any general conclusions to be drawn about the impact of changes in the operational profile p.
With an analytic derivation of the confidence bound, we can model the impact of a mismatch between the test profile and the actual demand profile and identify general strategies for designing test profiles that reduce the sensitivity of the bound to uncertainties in the operational profile.
The next section describes the approach we developed to derive an analytic solution for the confidence bound.

Solution Approach
In Appendix A we use Lagrangian multipliers to identify the stationary points that represent the potential solutions to (10) but the solution space is complex. There are 2 m − 1 stationary points and the optimal point depends on the specific values used in p and n. As a result, there is no simple analytic solution that can be applied to all operational profiles. So we developed an alternative approach for obtaining an analytic solution by deriving a conservative approximation for (10) that makes the problem easier to solve.
In this reformulation, the likelihood (7) is approximated as: It is a standard result [10] that Thus Therefore, the approximated confidence areã is a superset of the exact confidence area, i.e.
As a result, the approximate solution will always be conservative relative to the exact solution, i.e. for a given α, n, p The log of the approximated likelihood (11) is So the log of constraint (14) can be rearranged to become: If we define ∆q i =q i p i then constraint (20) can be reformulated as and equation (9) becomes: In this reformulation, the goal is to choose a set of ∆q i values that maximise (22) subject to constraint (21). Constraint (21) is a simple linear constraint for ∆q i . To maximise (22) we need to assign ∆q values to the demand class that makes the smallest contribution to reaching the upper limit of ln 1/α. So the procedure for maximising m i=1 ∆q i is: • Order the demand classes i in terms of increasing n i /p i value.
• Assign ∆q i up to the limit of p i for each demand class in turn until the confidence bound, ln(1/α), is reached.
• In general, the last class with ∆q i > 0 can only be "filled" partially, i.e. ∆q i < p i .
• The ∆q i values for the remaining demand classes are set to zero.
This assignment strategy is illustrated in Figure 1. It is a variant of the "bin-filling" strategy used in [11] where there is a worst case demand "bin" and this bin should be filled first.
There is an exception to this filling rule when several bins have an identical effect i.e. when the n i /p i values for the bins are the same. For equivalent bins, it does not matter how the ∆q i values are distributed amongst the bins provided the limit p i is respected for each bin.
If we assume that only a single bin k needs to be filled where: then equation (22) reduces to Q S (q, p) = ∆q k and hence constraint (21) reduces to So the confidence bound,q is: This upper bound still applies if more than one bin needs to be filled, butq would be unattainable within the approximated confidence area (15) and hence be too pessimistic. This is illustrated in Figure 2  In practice however, the single bin assumption can be fulfilled when ∆q k = p k implies: Since bin k will be part-filled if equation (27) is satisfied whenq k ≤ 1, it follows that the single bin criterion can be met if: For example, for 99% confidence, ln 1/α = 4.6, so a minimum of 5 demands on that bin would ensure that the "single bin" constraint (28) is met.

Approximation Accuracy
The analytic solution,q s , is a conservative approximation to the true confidence bound q.
Appendix A.3 makes a comparison between a Lagrangian solution for the exact non-linear programming model in equations (7 -10) and the approximate log-linear solution given in equation (26). The Lagrangian analysis derives a set of stationary points that represent potential solutions. For the equivalent of the single bin case where, for some p, n there is a worst case bin k that determines the result, where: where bin k is the demand class with the minimum gradient value n k p k . So the error relative to the exact bound q s is constrained bỹ It is clear from [30] that the approximation error could be high if the n k value is small. However the bound estimation error need not be large if the test strategy is designed to avoid having a worst case bin k where n k is small. The design of test strategies is discussed in more detail in Section 7. A less conservative bound on the approximation error can be found in Appendix A.4.

Modelling the Impact of a Profile Mismatch
With an analytic approximation to the confidence bound we can examine the impact of a mismatch between the test profile and the actual demand profile.
As noted earlier, if there is an exact match between the test profilep and the demand profile p then we obtain the standard statistical test result For a mismatched profile p =p where bin i = k maximises the ratio p i /p i . As a result, we can define a scale factor S that represents the scale-up in the confidence limitq s for a mismatched demand profile, where: This scale factor S remains the same regardless of the choice of confidence value, 1 − α. Equation 32 can be interpreted as reducing the number of "relevant" test demands to N = N/S which are distributed over classes i = {1 . . . m} according to the new profile, where: For the demand class k that determines S This means that all n k test demands for class k are included in the bound calculation for the new profile. For other demand classes n i < n i , so effectively some test demands for these classes are "discarded" if they do not fit into the new profile, resulting in a reduced number of tests N that are considered "relevant" when deriving the revised confidence bound.

Compensating for Profile Uncertainty
If there is an uncertainty ∆p probability of a demand class p k , then for the worst case demand class, the worst possible scale factor would be: This is a particular concern ifp k is very small relative to the uncertainty. For example, if there is a rare demand scenario which is estimated to bep k = 10 −6 but the uncertainty ∆p = 10 −4 then it is possible that the bound would increase by two orders of magnitude in actual operation if the demand class occurs at the maximum possible rate. Sensitivity to uncertainty in the demand probability can be reduced if extra tests ∆n k are performed such that: This test strategy is illustrated in illustrated in Figure 4. The extra ∆n k tests would not be relevant for the expected profile, so the bound based on N remains unchanged. But if the actual demand probability for the class lies withinp k andp k + ∆p, then some or all of the extra ∆n k tests are included, and a corresponding number of tests on other demand classes are excluded, so the total number of relevant tests (and henceq s ) remains unchanged.
An alternative strategy is to set a lower bound for the number of tests n min , so that extra tests are performed if the number of tests for a demand class (under the expected profile) falls below the lower limit, i.e. for all classes i = {1 . . . m}: ∆n i = n min − n i , n i < n min Provided the actual demand probability for all classes i p i ≤p i + ∆n i /N the confidence bound will not increase relative to the bound derived with the original profilep and test vector n. In practice, the extra testing required to accommodate demand profile uncertainty is likely to be fairly modest. For example, if there are 50 very infrequent demand classes and n min = 5, no more than 250 extra tests would be needed compared to the 4600 tests needed for a 99% confidence in a pfd of 10 −3 based on the assumption that the test demands and the actual profile match perfectly.
If we know very little about the profile and can only specify an upper bound p max for every demand class, then a much greater number of tests will be needed to assure that some target q t will be met for all possible profiles p subject to this constraint. From equation (26) Hence the number of tests n i required for every demand class i to have confidence in a target bound q t is

Numerical example
The effect of adding extra tests to compensate for profile uncertainty can be illustrated using the simple test vector shown in Table 1. The approximate and exact 95% confidence (α = 0.05) bounds are shown in Table 2 for the case where the test and operational profiles match, i.e. p =p. All the bound values are close to 10 −3 at the 95% confidence level and the relative over-estimation error forq s is around 0.05%. The "single bin" Lagrange lower bound estimate q * s is 0.4% less than the true bound. If there is uncertainty in the operational profile of ∆p = 0.1, then from (34) the scaling of the confidence bound, S, is bounded by max(p i + ∆p)/p i . The maximum scale-up of the confidence bound occurs when ∆p is applied to the lowest probability bin (i = 3) where the scale factor is S = (0.002+0.1)/0.002 = 51. The bound estimates for the revised profile (where p 3 =p 3 + ∆p and p 5 =p 5 − ∆p) are shown in Table 3. It can be seen that, in this case, the true bound q s is the same as the Lagrange lower bound estimate q * s . This is likely to occur whenever p i p i as the Lagrange equivalent to the single bin solution represents the worst case bound. The example also shows that the single bin approximationq s is a conservative upper estimate. The upper estimate is consistent with the maximum error of 27% predicted in (30). The confidence bound and its relative error can be decreased dramatically if the test vector n is padded so that demand class i = 3 is no longer the worst case bin. If every class is padded by ∆n i = N ∆p = 300, the original confidence bound is guaranteed if the departures from p do not exceed ∆p. Table 4 shows the bound equation results for these two scenarios. We can see that if demand class i = 3 is padded with 300 extra tests (a 10% increase on the original total) the worst case bound only increases by a factor of 2.3 (rather than 40). With 300 extra tests for all classes (a 50% increase on the original total) the original confidence bound will always be met.

Generalization of the Conservative Bound Method
The approach outlined above is expressed in terms of a profile of "demands" that represent the occurrence of some event external to the system. However the theory can be applied more broadly if different interpretations of a "profile" are used. Some alternative profile definitions which extend the applicability of the conservative reliability bound estimation method for an arbitrary profile are discussed below.
As a result, the strategies used in Section 7 for reducing sensitivity to operational profile changes are equally applicable in these new contexts. In particular, it is desirable to perform extra testing on the worst element of the system k which dominates the bound estimate.

Equivalence Class Coverage
In this definition, each "demand class" i represents a specific equivalence class in the input space of the program. As equivalence classes are disjoint (like demands) the same parameters p, n can be used to characterise the operational profile and the test vector, but the boundq s relates the probability of failure per program execution, where the boundq s at confidence level (1 − α) is

Structural Coverage
In this definition, each element i represents a different element within the software structure (such as a code segment). In that case the test vector n represents the number of executions of the segments during reliability testing. However, we cannot use an operational profile of probabilities p that sum to unity because the segment executions are not, in general, disjoint. For example, a sequence of code segments connected in series would all be executed at the same time. Furthermore, code segments can be executed inside a program loop and hence be executed multiple times for each invocation of the overall program. As a result, we need to define the profile as a vector of module executions x where x i represents the number of executions of segment i for each execution of the overall program. With multiple executions of the same segment and a sequence of segments being executed, it is possible that several segments will fail during the same program execution cycle and hence be merged into a single failure at the program execution level. We make the conservative assumption that segment failures will never merge. As a result, equation (9) can be redefined as As this equation is formally identical to (9), the demand-based analysis and approximations still apply, so the conservative bound on the probability of failure per program execution,q s , at confidence level (1 − α) is

Execution Time
For continuous time where a test vector t represents the test execution times without failure for a set of components, the single-bin exponential model provides an exact bound rather than an approximation (provided the failures for the elements i are disjoint). As a result, the analysis is formally identical the previous analyses so the confidence bound for the system failure rate per unit time, λ s , given an operational profile p of disjoint execution probabilities per unit time would be: This model can be extended to a concurrently executing set of components where the profile is expressed as a usage factor u. Note that the individual terms u i represent the proportion of time that the component is running. The usage factor can be greater than unity if multiple instances of the same component run concurrently. We can construct a normalised operational profile where there is a disjoint execution probability per unit time for component i such that p i = u i / j=1...m u j . Substituting this into (40) and rescaling by j=1...m u j , we obtain: The scaling of the operational profile means that different components will be executing at the same time. Since simultaneous (i.e. non-disjoint) component failures would be merged into a single failure at the system level, this effectively reduces the observed failure rate. So this equation remains a conservative upper bound even if the assumption of disjoint failures does not apply.

Relationship to Earlier Work
There has been extensive research on the use of statistical methods for estimating software reliability using realistic operational scenarios [12,13]. Adaptive testing strategies have been used to estimate confidence intervals (such as [14,15]) but these strategies are designed to adapt the test profiles once failures are observed, so they are not applicable to testing high integrity systems where no failures are expected.
Musa [16] and Crespo et al. [17] have modelled the impact of different operational profiles based on reliability growth during testing, but this is not directly applicable to high integrity systems where we do not expect to observe failures in the final test phase.
Bishop [18] used an operational profile characterised by the execution of code segments within the program to rescale a prior reliability bound, but in [18] the derivation of the reliability bound required an estimate of residual faults and was not explicitly related a confidence level.
Miller et al. [19] considered the impact of operational profile on reliability estimation and suggested discarding test results that did not conform to the new profile. However, there was no formal justification for discarding tests and this was proposed in the context of preparing input data for a Bayesian reliability analysis. Our analysis formally justifies the use of a "relevant" test subset in the context of a frequentist confidence bound model.
Ehrenberger [20] proposed a frequentist confidence bound model for a new operational profile which asserts that: However, it can be shown that the Ehrenberger model is only valid for cases where p i ∝ n i for a subset of the demand classes i and the remaining elements in the profile p are zero. For other non-matching profiles, the Ehrenberger model can produce non-conservative results. For example, if the model is applied to the example in Section 8 we obtain q s = 2.07 × 10 −3 which is significantly less than both the true confidence bound (40.1 × 10 −3 ) and our conservative approximation (50.9 × 10 −3 ).

Summary and Conclusions
This paper has presented a conservative analytic method for estimating the reliability bound given a specified confidence level, a set of test demands and an arbitrary operational profile. Based on this model we show that the "scale-up" in failure rate can be highly sensitive to uncertainties in demand probability of infrequent demand classes. We also show that adding some "padding" tests for infrequent demand classes can ensure that the original confidence bound will remain valid for a range of demand probabilities for a given class.
We have also shown that the same conservative bound estimation method can be applied in other contexts, e.g. where testing is defined in terms of time rather than demands and equivalence domains rather than demand classes.
Thus, for every stationary point r , either λ i = 0 or r i = 1, i = 1 : m Therefore, we can identify every stationary point with a binary vector (r i ) ni = α;