Assessing Confidence in Frontier AI Safety Cases

This is the abstract and executive summary of the paper which I co-authored together with Stephen Barrett. Philip Fox, Joshua Crook, Tuneer Mondal and Alejandro Tlaie as part of the Arcadia Impact AI Governance Taskforce. My contribution focused on defeaters (claims which, if true, invalidate the argument) and, in cases where multiple defeaters exist, how safety teams should prioritise their investigation.

Link to the full paper published on ArXiv

Abstract

Powerful new frontier AI technologies are bringing many benefits to society but at the same time bring new risks. AI developers and regulators are therefore seeking ways to assure the safety of such systems, and one promising method under consideration is the use of safety cases. A safety case presents a structured argument in support of a top-level claim about a safety property of the system. Such top-level claims are often presented as a binary statement, for example “Deploying the AI system does not pose unacceptable risk”. However, in practice, it is often not possible to make such statements unequivocally. This raises the question of what level of confidence should be associated with a top-level claim. We adopt the Assurance 2.0 safety assurance methodology, and we ground our work by specific application of this methodology to a frontier AI inability argument that addresses the harm of cyber misuse. We find that numerical quantification of confidence is challenging, though the processes associated with generating such estimates can lead to improvements in the safety case. We introduce a method for better enabling reproducibility and transparency in probabilistic assessment of confidence in argument leaf nodes through a purely LLM-implemented Delphi method.  We propose a method by which AI developers can prioritise, and thereby make their investigation of argument defeaters more efficient.  Proposals are also made on how best to communicate confidence information to executive decision-makers.  

Executive Summary

What does it mean to have confidence in a safety case and why is such confidence needed?

Safety cases generally make binary claims such as “Deploying the AI system does not pose unacceptable cyber risk”.  However, a decision-maker or evaluator is then left asking the question: what confidence is there that the claim is true? 

Part of the answer involves showing that the argumentation is logically sound and complete.  Another part of the answer is to enable the degree of belief in a top-level claim to be expressed as a probability.  Such probabilistic quantification of confidence would be valuable because it could help an AI developer to know what doubts about the safety case are of most concern, as well as to determine when sufficient safety assurance work has been done.

All these aspects are covered by a safety assurance framework called Assurance 2.0, and it is this framework which we have applied to the problem of establishing confidence in a frontier AI ‘inability’ safety argument for the case of cyber misuse.  

How can confidence be assessed in a way that is as reproducible, transparent and as  non-subjective as possible?

It can be difficult for third parties to have high confidence in an argument produced by an AI developer where the probabilistic evaluations of the level of truth in a claim are not reproducible on account of being made subjectively or including reasoning which is not transparent.

The process of establishing and assessing confidence in safety cases can sometimes require assigning probabilities to possible future events or outcomes, which are often conditioned on existing relevant information or evidence.  We describe a method for addressing this problem using a variation of the Delphi method, in which the role of the experts is performed by LLMs.  This new approach for use in establishing  probabilistic confidence in safety cases is reproducible and has the potential to offer more transparency compared to the situation where human experts provide probabilistic valuations.  We believe the approach is worthy of further investigation and consideration.    

How to identify defeaters (i.e. doubts and challenges about the safety case)?

Given the inevitable blindspots of AI developers in assessing their own systems, a comprehensive search for defeaters requires critical review by both internal and external (third-party) challengers.  We suggest following a “dialectical method”, wherein safety case challengers have the explicit goal of finding flaws in the AI developer’s reasoning and are thus well-placed to counter biases that may otherwise exist within the teams developing the AI system and its safety case.  

How can it be quantitatively determined that the level of confidence in a safety case has become sufficient such that the system can be deployed?

Belief in the claims of an argument can be quantified probabilistically, for example by a variation of the Delphi method described above.  These argument leaf-node probabilities can be aggregated (propagated up through the safety case) to obtain a quantified probabilistic assessment of confidence in the top-level claim.

We applied Assurance 2.0’s ‘sum of doubts’ and ‘product’ methods for probabilistic quantification to a small fragment of a cyber misuse safety case.  

We found that, for both methods, in order to achieve a modestly high (95%) confidence in the overall claim for this small safety case fragment (comprising just 7 argument components), we required very high (~99.3%) confidence in each of the components.   

We conjecture that it would be very challenging to achieve such high levels of confidence in the argument leaf nodes, in particular where such confidence assessments require judgements to be made about uncertain (e.g.) future outcomes.  We would anticipate that the challenge of achieving high confidence will only get greater when considering the safety argument as a whole, rather than just a small fragment of it. 

It should be cautioned however, that our results for the cyber misuse safety case fragment were for a particular argument structure (logically conjunctive and independent claims), and it would be worth considering how things might change, and potentially improve in cases where alternative argument structures are possible, for example in cases where multiple diverse sub-arguments could be used to support a single claim. 

Things might also improve where a probabilistic target forms an intrinsic internal part of the argument itself.   In summary, it remains an open question whether realistic and practical methods might be found to achieve the very high probabilistic confidences in safety cases that would be required for the most serious, or even catastrophic, harms.  Our results certainly point to it being a non-trivial challenge, and one that will require further investigation.      

How can AI developers prioritise which defeaters to tackle first?

Defeaters may be eliminated through system and/or safety case modification.  The order with which these doubts are tackled affects the efficiency with which an AI developer’s workforce can converge upon an acceptable system design and associated safety case.  We describe a methodology that AI developers can follow, that takes into account the defeater’s potential impact on probabilistic confidence in the top-level claim, potential impact on the logical soundness of the argument, the probability of the defeater being sustained and the expected effort required to resolve the defeater.

How can the level of confidence in a safety case best be communicated to executive decision-makers?

Decisions on whether to deploy an AI or cease its operation will be made by executives who may have limited technical background.  To the extent possible, we advocate for the use of visual as opposed to textual communication of confidence.

Introduction

‘Frontier AI’ refers to a class of the most advanced, highly capable, general-purpose AI models that can perform a wide variety of tasks, and which currently primarily encompass foundation models consisting of very large neural networks using transformer architectures.  These powerful new frontier AI technologies can bring many benefits to society, but at the same time can bring new risks. Companies that are developing frontier AI systems are therefore seeking methods to assure the safety of their systems before and during deployment, and potentially even before commencing training. One promising method for assuring the safety of such systems is through the use of safety cases (Buhl et al. 2024)

Safety cases make use of a structured argument that is supported by evidence in order to make a claim about the safety properties of a system. Whilst many elements of a safety case may be very well supported, in a typical safety case, there will nevertheless remain some sources of uncertainty and residual doubt. These uncertainties may, by way of example, occur due to the inherent nature of the argument that is being made (e.g. including inductive components vs. purely deductive), due to the type and quality of the evidence provided or due to a multitude of other possible factors. Hence, when making a decision on whether or not to deploy an AI model, a decision maker can be expected to benefit from information detailing firstly what level of confidence should be placed in the assurance claim that is being made, and secondly the specifics of the main sources of uncertainty and residual doubt that influence this level of confidence.

This paper applies one of the foremost approaches for building a safety case and determining an associated level of confidence, Assurance 2.0 (Bloomfield and Rushby 2024a) to identify what particular lessons can be drawn when assessing confidence in safety cases of frontier AI systems. To illustrate how our findings can be applied in practice we show how the concepts can be applied with reference to the safety case template for an inability argument for the cybersecurity misuse harm that is described in (Goemans et al. 2024).  In an ‘inability’ argument a case is made that the AI system lacks the ability to cause harm.  This is in contrast to other types of argument, for example a ‘control’ argument where the AI system does have the ability to cause harm, but is controlled so as not to do so, or a ‘trustworthiness’ argument where again the AI system has the ability to cause harm but a case is made that it can be trusted not to do so (Clymer et al. 2024).  We focus on the cyber misuse harm and the inability safety case, not only because of the availability of the safety case template but also because dealing with cyber misuse and creation of good inability safety cases represent near-term, and hence high priority challenges for the community. Cyber misuse is also a class of harm for which there is a relatively well-developed understanding of risk. 

In Section 2 of the paper the rationale for selecting the Assurance 2.0 methodology is described and a high-level introduction to the technique is provided. Section 3 describes how confidence in the logical validity and soundness of the safety case can be attained. Section 4 describes how to assess probabilistic confidence in the leaf nodes of an argument.  Section 5 describes how confidence can be enhanced through processes that enable the case to be challenged such that sources of doubt can be surfaced. Section 6 describes how to gain confidence when residual risks must remain in the safety case. Section 7 addresses the question of how quantified assessments of confidence, made for the various elements comprising the argument, can be propagated up into an overall confidence assessment of the top-level claim of the safety case. Section 8 addresses the issue of how to prioritise the resolution of defeaters (doubts about the safety case). Section 9 identifies the preferred form that confidence level information should take for it to be communicated to, and be most actionable by executive decision-makers. Finally, the overall conclusions are provided in Section 10.


Next
Next

Scalable AI Incident Classification