Validity and Clustering of AI Safety Incident Causality Analysis

Summary

Expanding on a previous project implementing Root Cause Analysis (RCA) of AI Safety Incidents using a Large Language Model, this article identifies and addresses some shortcomings and potential objections to the original method including reliability and epistemology of responses.
Improvements are proposed based on LLM parameter optimisation (Temperature and Top-P) and Chain of Thought prompting.
Variations on these potential improvements are evaluated using the UK AI Safety Institute’s Inspect framework and the optimal configuration taken forward to produce inputs for Cluster Analysis.
Objectives are identified for Root Cause Analysis including specific questions of interest to policy advisers, and a method is defined for conducting Cluster Analysis using Principal Component Analysis to create scatter plots.
Cluster Analysis is conducted on a sample of incident reports from the AI Incident Database with some interpretation and description of the identified clusters and a review of correlation between clusters and different types of harm.
The article closes by observing some of the shortcomings of Root Cause Analysis in the context of AI Safety and looks at Systems Theoretic Process Analysis as a potential alternative or supplementary framework, proposing future work to explore this further.

AI Safety Fundamentals Capstone Project

This post is submitted as my capstone project for the BlueDot Impact AI Safety Fundamentals Governance course. I intend to build on it going forward so feedback through this form will be greatly appreciated.

From a project perspective, the objectives are focused on the exploration of validity evaluation and potential routes to improvement as well as developing a workflow for cluster analysis. I recognise that this scope is already broad and running cluster analysis on a larger sample to gain more robust insights will fall to future work. As such, the outputs of this current project can feed into a research agenda to understand more about how LLMs can best be used in causality analysis of AI safety incidents.

Introduction

A preliminary framework has been proposed to use Large Language Models to process AI safety incident reports, conducting Root Cause Analysis (RCA) that could be used to inform policy decisions relating to prioritisation of risk mitigation societal adaptation measures. The use of LLMs is intended to provide scalability, creating capacity to process the expected significant increase in the number of AI related safety incidents and allowing more value and learning to be extracted than would be possible by human teams conducting RCA.

The following process flow diagram shows how Causality Analysis could contribute to informing the Prioritisation of Effort on AI Safety Risk Interventions within a Risk Management framework.

PROCESS FLOW: Prioritisation of Effort on AI Safety Risk Interventions

Cluster Analysis of Causes

Whilst understanding individual incidents can help us to prevent or mitigate exactly the same thing occurring in future, there would be significant benefits to studying large datasets of safety incidents in order to understand patterns and clusters. Where commonalities exist, it may be possible to eliminate multiple potential causes by putting in place a single intervention or risk mitigation measure, thereby allowing limited budgets and resources to be used as efficiently as possible.

Objectives of Root Cause Analysis: Validity Requirements

As outlined in the Theory of Change here, Root Cause Analysis could be used to work towards a number of different outcomes. Depending on the objective, the rigour and validity requirements of the analysis will be different:

To rigorously understand causal factors for historic incidents based on known causal pathways, a high degree of confidence in the validity and epistemology of the analysis is required. The next steps following this analysis would be to make policy decisions based on identified causal pathways therefore invalid claims could result in sub-optimal actions being taken.

However if the objective is to propose previously unidentified causal pathways for historic incidents then it would be acceptable to generate a number of ‘possible’ causes and causal pathways even if the confidence in them is low. This anticipates follow-up analysis and testing of hypotheses to validate or eliminate them. Such exploratory work will benefit from the proposal of a higher quantity of novel proposed causes rather than outputting just the few contributing factors that have a ‘high-certainty’ as it is through these novel causal pathways that new mitigation or harm reduction measures might be identified.

For the assessment of potential future scenarios to identify potential causal pathways, there is even less certainty due to the fact that the incident itself will be modelled or simulated and causal pathways depend on the fidelity of the simulation. RCA in this case will again be used to provide preliminary hypotheses that will require further validation.

Validity requirements and acceptance of novel analyses depend on objectives of RCA

This type of RCA is not attempting to understand causes of AI behaviours within the ML models which for our purposes is considered a black-box. That is the domain of interpretability research- another field in itself. Instead, we focus on causes that apply at a higher ‘system’ level.

Comparison with Human Analysis

The use of LLMs to conduct RCA offers the benefit of scalability over human teams however it is important to understand if and how the analysis conducted by models varies from that conducted by humans. This is worthy of an in-depth study in itself, however as a preliminary investigation I selected Incident 74 from the AIID which covers the wrongful arrest of a man following the use of facial recognition technology by law enforcement in Detroit.

The AIID is curated by a team of human editors who use the CSETv1 and GMF taxonomies to categorise information relating to incidents. For incident 74, the “Potential Technical Failures” are classified as “Dataset Imbalance, Generalization Failure, Underfitting, Covariate Shift”:

Taxonomy Classifications for AIID Incident 74

However 5 human reviewers who read the reports unanimously identified a non-technical cause as the most significant contributing factor to the incident: Misuse of facial recognition as sole evidence. As the New York Times article quotes:

“This document is not a positive identification,” the file says in bold capital letters at the top. “It is an investigative lead only and is not probable cause for arrest.” NY Times Report on Incident 74

Baseline Reliability of LLM RCA Analyses

Initial exploration of the LLM-powered RCA used default model parameters and a prompt requesting the LLM uses only data that can be referenced directly in the incident report provided as context to the prompt:

Initial RCA Prompt:
Use a Systems Engineering approach to extract potential root causes for the AI Safety Incident detailed in the report below.
Please ensure that all elements of the response are supported by content in the incident report.List the identified causes in order of significance with the most significant causal factor first. Each cause should be returned as a short phrase of no more than 6 words.
If there is no evidence in the incident report to make an assertion, just leave that part of the response empty.

Top 5 significant causal factors as identified by LLM across 10 analyses of Incident 74

Across a sample of 10 analyses, the model always identified ‘Lack of corroborating evidence’ or ‘Overreliance on technology by the police’ within the top 5 contributing factors - these are the two causes which closely relate in meaning to the cause identified by human reviewers (’Misuse of facial recognition as sole evidence’). However it never ranked either of these as the most significant cause.

The model identified ‘Racial bias in facial recognition technology’ (70%) and ‘Faulty facial recognition technology’ (30%) as the most significant contributing factors.

Perfect agreement between the LLM analysis and human analysis is not absolutely essential, and as outlined above, may in some cases be sub-optimal if we hope to use LLM analysis to identify causal relationships that humans had not spotted. However moving forward with the use of the model when it never matched primary cause identification with humans in what appears here to be a fairly clear-cut and straightforward case did not seem wise - better to first understand more about the reliability and epistemology of the analysis.

It was apparent that running the analysis with this and similar prompts repeatedly on a single set of incident reports resulted in a wide range of outputs. In some instances the model identified causes which were supported by data in the prompt context, however ranked the causes in different orders - this can be described as a lack of reliability or repeatability.

In other instances, the model claimed in its response that a cause existed for which there was no direct reference in the prompt context. This is a shortcoming in the epistemology of the analysis particularly where the prompt is explicit in requesting that the model only responds with causes that are supported by evidence in the context. We can assume that these unsupported claims are based on the LLM’s training data and can be described as a hallucination.

Improvements to RCA Reliability and Epistemology

These preliminary evaluations provide some insight into shortfalls in the model’s performance / configuration. We have a few levers that could be pulled in an attempt to reduce the extent of these failures:

Parameter Tuning
- The temperature parameter in ChatGPT controls the randomness of the model's responses, with higher values (up to 1) producing more varied outputs and lower values (closer to 0) yielding more deterministic responses. Hypothesis: reducing the temperature parameter will improve the reliability of the analyses which should provide a higher degree of confidence in the output however it may also reduce the novelty which may be undesirable if the objective is to identify previously unidentified causal pathways.
- The Top-P parameter in ChatGPT, also known as nucleus sampling, controls the diversity of the model's responses by limiting the selection to the smallest set of top probabilities that add up to a certain probability mass, typically set between 0 and 1. Hypothesis: we can expect a lower Top-P to yield more consistent and predictable results, focusing on the most common and likely causes identified in the reports. A higher Top-P could return more varied and potentially insightful results. It could uncover less likely/relevant but potentially still significant causes.
Chain of Thought Reasoning The Chain of Thought technique was explored and described by Wei et al (2022). It prompts the LLMs to take a series of intermediate reasoning steps in order to answer questions that require more complex reasoning than can be covered in a single shot. Hypothesis: applying Chain of Thought Reasoning should reduce the number of hallucinations as the intermediate CoT steps would be simpler to answer robustly.
Retrieval Chain Whereas Chain of Thought focuses on logical reasoning and step-by-step problem-solving, Retrieval Chain retrieves specific data elements that contribute to the response in a sequential manner, and can be used in such a way as to retain the references. It can refer to external data sources such as databases or depending on the model configuration and scaffolding, access online resources. Hypothesis: applying Retrieval Chain to causality analysis would improve the reliability, reducing hallucinations and providing traceability for human teams to review analyses and validate assertions against the input data (incident reports)

Challenges with Evaluation Methodology

The results presented above were obtained by running a Python script, using the OpenAI API to iteratively prompt for causal analyses. It quickly became evident however that comparing samples poses some challenges:

When prompts respond with natural language, comparing responses is non-trivial. Minor differences in wording may in some cases result in completely different meanings, and in other cases result in no significant difference in meaning. A subjective assessment of whether responses are ‘equivalent’ is unscientific and it would be preferable to have a separate grading function.
Attempts to address this by requesting multiple choice responses rather than natural language introduces limitations on the flexibility of the system: a set of potential causes to choose from for one incident may not be appropriate for another incident.
Introducing Chain of Thought prompting requires interfacing with a separate API such as LangChain or the development of a framework from scratch. This increases complexity, maintainability requirements of such a framework.

To address these challenges, I used the Inspect evaluation tool developed by the UK AI Safety Institute (AISI).

Evaluation Framework: AISI Inspect

Inspect is a framework developed by the UK AI Safety Institute for Large Language Model evaluations which addresses some of the above challenges. I was very keen to gain hands-on experience using it, not only to provide insights into how the configuration of causality analysis by LLM affects output, but also to build proficiency with the tool, making the experiments a valuable learning experience.

Inspect is highly versatile and capable of evaluating complex models. However for the purposes of this article, a simple proof of concept analysis is modelled within the framework to identify the most significant contributing factor to Incident 74 using the reports from the AIID, in order to explore how changing parameters and using CoT prompting affects reliability of output and consistency of responses with those of human reviewers.

Walking through the setup and workflow:

Evaluation Task

A function is written, returning a ‘Task’ and defining the following components which make up the evaluation:

Dataset
Solvers (Plan)
Scorer

This example uses the incident_74 dataset that I extracted from the AIID, with chain of thought reasoning and multiple-choice answers, evaluated on the correctness of the choice.

@task
def incident74():
    return Task(
        dataset= example_dataset("dataset_incident74"),
        plan=[system_message(SYSTEM_MESSAGE), 
            chain_of_thought(), 
            multiple_choice()],
        scorer=choice()
        )

Dataset

The prompt “Please select the most significant contributing factor” is prepended to the multiple choice list in the dataset.
A set of 100 LLM analyses, each producing the top 5 most significant causal factors were taken as input to the process of defining the multiple choice options. Duplicates were eliminated from the list and then causes with the same meaning but slight variations in wording were grouped together. The resulting list was:

A) Racial bias in algorithms B) Inadequate dataset diversity C) Structural racial bias in law enforcement D) Another kind of racial bias E) Flawed training of algorithm F) Software bug causing unintended behaviour G) Inadequate image quality H) Lack of corroborating evidence I) Lack of human oversight on use of technology J) Misuse of Facial Recognition as Sole Evidence K) Inadequate Validation of AI Outputs L) Failure to Disclose Use of Facial Recognition M) Poor Police Investigative Work N) Lack of Training within law enforcement

The TARGET (correct answer which corresponds to that given by human reviewers) is: J) Misuse of Facial Recognition as Sole Evidence

dataset_incident74.jsonl:

{"input": "Please select the most significant contributing factor", "choices":["Racial bias in algorithms","Inadequate dataset diversity","Structural racial bias in law enforcement", "Another kind of racial bias", "Flawed training of algorithm", "Software bug causing unintended behaviour", "Inadequate image quality", "Lack of corroborating evidence", "Lack of human oversight on use of technology", "Misuse of Facial Recognition as Sole Evidence", "Inadequate Validation of AI Outputs", "Failure to Disclose Use of Facial Recognition", "Poor Police Investigative Work", "Lack of Training within law enforcement"], "target":"J"}

Plan

The solver task is based on a plan comprising the following components:

System Message This comprises the system context and the full text of the incident reports being analysed

SYSTEM_MESSAGE = """
You are a System Safety expert tasked with analysing AI safety incident reports to understand the most likely cause of the incident. Please follow a Systems Engineering approach and assume that the reader is also well versed in AI Safety and Systems Engineering.

INCIDENT REPORTS FOLLOW: 
<[Full text from AI Incident Database pasted here](<https://incidentdatabase.ai/cite/74/>)>
"""

Chain of thought Instructs the solver to include CoT reasoning. This optional component is omitted/included during different runs in order to compare the accuracy of the output with or without Chain of Thought reasoning.
Multiple Choice The ‘multiple_choice’ solver pairs with the ‘choice’ scorer

Scorer

Choice For open-ended questions where interpretation of meaning in responses, the Model Graded scorers can assess whether responses contain a fact or content contained in a defined ‘Target’. However for this simple case with a multiple choice prompt, the score is a function of the letter contained in the response after the ANSWER string (for example: response ends with “ANSWER: J”

scorer=choice()

Model

The model is defined in a .env file (along with the API_KEY)

INSPECT_EVAL_MODEL=openai/gpt-4o

Running

The evaluation is run from the terminal using the inspect eval command, passing as arguments the filename for the python script and any model parameters such as Temperature or Top-P. For example:

> inspect eval [incident74.py](<http://incident74.py/>) --temperature 0.2

Experiment 1: Chain of Thought Reasoning

To test the hypotheses that CoT would improve the accuracy of responses, I used the AISI Inspect framework to run the above query with both Chain of Thought and No Chain of Thought reasoning and with either default Temperature and Top-P parameters, or with values of 0.2 or 0.9 for each parameter. Open AI do not explicitly state what the default values are for Temperature or Top-P if users leave them unspecified.

A sample of 20 queries was submitted in each configuration and the responses scored by the evaluation Scorer. A response of (J) Misuse of Facial Recognition as Sole Evidence results in a ‘Correct’ grading. Any other response is graded as Incorrect.

Results

The Inspect framework provides detailed logs that can be explored further to view for example the CoT reasoning applied by the model in responding. Examples of these are attached in the Appendix.

The accuracy score represents the proportion of responses that were returned correctly matching the ‘target’: an accuracy score of 1 corresponds to ALL of the responses in that sample of 20 being graded as correct.

Analysis

With higher values of Temperature or Top-P parameters, applying Chain of Thought reasoning improves the accuracy of the responses. There are no cases where CoT reduced the accuracy of the results

This suggests that applying CoT should be applied for this type of analysis where confidence in epistemology is important.

Experiment 2: Temperature / Top-P Parameter Values

Building on the results of Experiment 1, I wanted to select a Temperature or Top-P value to use for the cluster analysis experiments assuming that CoT is applied. I therefore ran the query with values of Temperature and Top-P of 0.1, 0.3, 0.5, 0.7 and 0.9

A sample of 20 queries was submitted in each configuration and the responses scored by the evaluation Scorer. A response of (J) Misuse of Facial Recognition as Sole Evidence results in a ‘Correct’ grading. Any other response is graded as Incorrect.

Results

As above, the accuracy score represents the proportion of responses that were returned correctly matching the ‘target’: an accuracy score of 1 corresponds to ALL of the responses in that sample of 20 being graded as correct.

Analysis

As previously noted, appetite for increased novelty in responses at the expense of reduced confidence will vary according to the objectives of analyses. For the purposes of this study, to generate potential causes from incident reports that could be used for cluster analysis, it is desirable to have as much variation as possible whilst still basing assertions only on data that exists in the context and not on similarities or pattern matching with implicit knowledge about incidents or reports in the model’s training data.

As hypothesised, higher values of both temperature and Top-P result in lower accuracy responses. The highest parameter value (suggesting most creative responses) before accuracy begins to drop is a Top-P value of 0.5.

Therefore from the data available it is proposed that RCA analysis of AI Safety Incidents on which to apply Cluster Analysis are generated using a Top-P value of 0.5, with Chain of Thought reasoning applied.

I have chosen to only modify one of the Temperature / Top-P parameters at a time, leaving the other set as default. It will require deeper investigation to gain a fuller insight into the relationship between these parameters (particularly varying the combinations of the two values) and reliability / epistemology. Small sample sizes have been used simply as a matter of practicality for this study, based on cost of API token credits and available time, particularly for the earlier tests not using AISI Inspect, where post-processing of data was not automated and was laborious with larger samples. It is recognised that repeating the experiments in this study with larger samples would improve confidence in the conclusions reached.

Cluster Analysis

The high level objective of the cluster analysis is to identify patterns, correlations and clusters in causes of AI safety incidents in order to prioritise efforts on developing mitigation and harm prevention measures. For example, if it could be established that eliminating cause X would have prevented 10% of safety incidents classified as Severe, then an objective decision could be made as to whether to invest in intervention measures that address cause X.

Specific Questions

As a preliminary investigation, I have selected to explore the distribution of different Harm Types, to see whether there are correlations between clusters and harm types. I have done so using the full dataset of identified causes and have also split out the causes by category, to understand whether certain categories have better clustering of causes than others.

Data Generation for Cluster Analysis

For the purposes of this post, I am focusing on ‘potential cause identification’. As such the cluster analysis uses Ishikawa/Fishbone diagrams rather than for example Fault Tree Analyses.

I have used the AISI Inspect framework to create Ishikawa diagrams for a sample of 50 incidents from the AI Incident Database. The AIID incident numbering system is roughly chronological, starting at Incident 1 in 2015 through to Incident 746 as the latest recorded at the time of writing (2024). In order to include incidents spread across this time-period, I selected 50 incidents at random.

The raw data used is:

1. The taxonomic analysis of the incident as created by the AIID editing team.

I used a LLM to convert the textual taxonomic website content from each incident (example) to a JSON onject. The 50 JSON objects are them used as one of the data inputs to the cluster analysis.

Example JSON object derived from AIID taxonomic classification:

"incident": 1,
      "Report Count": 14,
      "Incident Date": "2015-05-19",
      "Editors": "Sean McGregor",
      "CSETv1 Taxonomy Classifications": {
        "Harm Distribution Basis": "none",
        "Sector of Deployment": "arts, entertainment and recreation, information and communication"
      },
      "CSETv0 Taxonomy Classifications": {
        "Severity": "Moderate",
        "Harm Distribution Basis": "Age",
        "Harm Type": "Psychological harm",
        "AI System Description": "A content filtering system incorporating machine learning algorithms and human reviewers. The system was meant to screen out videos that were unsuitable for children to view or that violated YouTube's terms of service.",
        "System Developer": "YouTube",
        "Sector of Deployment": "Arts, entertainment and recreation",
        "Relevant AI functions": ["Perception", "Cognition", "Action"],
        "AI Techniques": ["machine learning"],
        "AI Applications": ["content filtering", "decision support", "curation", "recommendation engine"],
        "Location": "Global",
        "Named Entities": ["Google", "YouTube", "YouTube Kids"],
        "Technology Purveyor": ["Google", "YouTube"],
        "Beginning Date": "2015-01-01",
        "Ending Date": "2018-12-31",
        "Near Miss": "Unclear/unknown",
        "Intent": "Accident",
        "Lives Lost": "No",
        "Data Inputs": "Videos"
      }
    }

2. The contents of the reports that the AIID team have associated with each incident.

I used AIID’s downloadable database which contains all source reports. Then ran causality analysis using the AISI Inspect evaluation tool to maintain consistency with the approach described above (Chain of Thought prompting with a Top-P parameter value of 0.5). A Python script iterated through the 50 selected incidents, creating up to 4 causes in each of the category areas (Technology, Data Input, Human Factors, Process and Methods, Regulatory Environment and Management). This allows the creation of an Ishikawa or Fishbone Diagram to represent the potential causes of the incident that have been identified.

Example Ishikawa Diagram for AIID Incident 1

The Ishikawa diagrams were saved in JSON format and passed as inputs for Cluster Analysis.

Example JSON object derived using LLM analysis of raw incident reports:

{
  "incident": 1,
  "causes": [
    {
      "Technology": [
        "Algorithm fails to filter content",
        "Lack of effective content moderation",
        "Inadequate machine learning training",
        "Open platform vulnerabilities"
      ]
    },
    {
      "Data Inputs": [
        "Inappropriate content in video uploads",
        "Misleading video thumbnails",
        "Inaccurate video metadata",
        "Keyword manipulation by bad actors"
      ]
    },
    {
      "Human Factors": [
        "Parents not monitoring content",
        "Users exploiting platform for profit",
        "Inadequate human review process",
        "User flagging system dependency"
      ]
    },
    {
      "Process And Methods": [
        "Slow content review process",
        "Inconsistent enforcement of guidelines",
        "Delayed removal of flagged content",
        "Reactive rather than proactive measures"
      ]
    },
    {
      "Regulatory Environment": [
        "Lack of stringent content regulations",
        "Insufficient oversight by authorities",
        "Inadequate penalties for violations",
        ""
      ]
    },
    {
      "Management": [
        "Insufficient investment in moderation",
        "Focus on revenue over safety",
        "Slow response to content issues",
        "Inadequate parental control features"
      ]
    }
  ]
}

Cluster Analysis Methodology

The data is loaded by the Python program, flattened and merged into a dataframe. The causes from the Ishikawa diagrams are then converted into embeddings using Sentence-BERT then Primary Component Analysis (PCA) is used to reduce dimensions of the embeddings, creating a vector for each identified cause. K-Means clustering is performed to group the vectors.

I explored different numbers of clusters (between 3-8) and the program calculated a Silhouette score (measure of how well clustered the vectors are) to select the number of clusters resulting in the highest score. Whereas earlier attempts that used TF-IDF vectorisation showed very strong clustering, including with some higher numbers of clusters, it became apparent that these were frequently caused by some key words in the cause descriptions (such as “Lack of” or “Insufficient”) rather than the underlying meaning. This was not insightful and led me to replacing TF-IDF with Sentence-BERT.

The program generates a scatter plot for the ‘Overall’ dataset and then for each of the cause categories. It also produces a CSV file of the individual causes associated with the vectors in each cluster. It also produces a stacked bar-chart to visualise the distribution of clusters in each of the Harm Types from the Taxonomy data.

Results

The full results can be found in the Appendix.

Silhouette Scores

The analysis produced clustering with Silhouette scores between 0.35 and 0.45

Interpretation of Silhouette Scores

Good: Greater than 0.5 would indicate strong clustering, with good separation between clusters

Moderate: Scores between 0.2-0.5 suggest the points are not ideally clustered but still form some meaningful clusters

Low: Below 0.2 suggests there may not be any meaningful clustering

The scores calculated from this analysis are all in the ‘upper-end’ of moderate. Further work would be required to understand whether there are improvements to the methodology that may provide a rigorous route to observing stronger clustering or not.

An LLM was used to summarise the commonalities/theme in each cluster from the lists of individual vectors within each. Full results including Harm Type Distributions are linked in the Appendix below.

Overall - Silhouette Score 0.35

The ‘Overall’ dataset contains all the causes across all 6 categories - a total of 893 identified and vectorised. As such it is hardly surprising that the clustering here is relatively weak with a Silhouette score of just 0.35.

The LLM summarises the themes of each cluster below. As can be seen from the scatter plot, there is not clear separation between the identified clusters therefore unsurprisingly, the assignment of vectors to one or other cluster is in numerous cases ambiguous.

Harm Types are well distributed across the ‘Overall’ dataset so little insight can be gained by analysing clustering of Harm Types without digging into more refined samples including the different Cause categories below:

Technology - Silhouette Score: 0.40

The ‘Technology’ dataset has somewhat clearer clustering with there being some variance and separation on the plot between clusters.

Of note in the Harm Type distribution is that Cluster 1 is less frequently associated with Economic, Financial and Physical harm but relatively more with discrimination and social/psychological harm.

Data Inputs - Silhouette Score: 0.45

We can see some clustering and separation between clusters looking at Data Inputs:

Cluster 2 is made up of fewer vectors, and therefore is associated with fewer causes. It is also less frequently associated with Physical, Financial or Economic harm than the other Harm Types.

Human Factors - Silhouette Score: 0.37

The relatively low score indicates less clear clustering and separation than for some of the other cause categories.

Process and Methods - Silhouette Score: 0.42

The separation, particularly of cluster 2 is reasonable.

Cluster 2 is associated more frequently with discrimination and psychological harm and is not at all associated with financial or physical harm.

Regulatory Environment - Silhouette Score: 0.45

Although Regulatory Environment displays reasonable clustering and separation between vectors, as reflected by its score of 0.45.

Cluster 2 is focused on lack of regulations whereas cluster 1 is associated with compliance.

Cluster 0 Cluster 1 Cluster 2 General Regulatory Gaps Specific Legal and Compliance Issues AI-Specific Regulatory Challenges This cluster focuses on broad regulatory shortcomings and lack of oversight in various aspects of AI and technology. This cluster highlights specific legal, compliance, and enforcement issues related to AI and its applications in different sectors. This cluster emphasizes the lack of AI-specific regulations and frameworks to govern the development, deployment, and use of AI technologies.

Management: Silhouette Score: 0.39

Although the separation of cluster 0 is reasonable, the score of 0.39 is relatively low.

Cluster 1 is more frequently associated with causes in the Discrimination category than the other clusters. It is also happens to be the only cluster with causes associated with Financial harm.

Cluster Analysis Conclusions

This has been a valuable introduction to the application of cluster analysis to RCA in AI Safety Incidents however it has just illustrated that there are huge opportunities for further work in at least 4 areas:

Deeper analysis of the results from this study to fain further insights into correlations between clusters and other taxonomical classifications
Reviewing the methodology used, identifying shortcomings and opportunities for improvement to add rigour and confidence in the results
Increase the sample size and explore other data sets.
Identifying pertinent and insightful questions on which to focus where knowledge about clustering could have the most impact on policy decisions.

This work has brought to light the inherent ambiguity of Natural Language Processing: it is evident that the exact wording of the causes being used as inputs for the analysis has an arbitrary element if generated by an LLM based on a news report. Implementing a comprehensive taxonomy of causes so that wording is standardised would reduce the potential for things with similar meanings to be interpreted differently and is worthy of further investigation.

Looking at the clustering of the causes (full lists attached in Appendix), it is apparent that the ‘summary classification’ for each cluster is in some cases ambiguous or vague. It would be of value to explore ways to make this summary classification more robust, although if more clearly defined clusters were identified in future data sets, there is inherently less ambiguity.

Future real-world utility

To establish the value of this type of analysis and to shape the direction of this work towards being useful in the real world, I shared the premise with some members of DSIT’s AI Regulation, Policy Advice and Risk Teams and asked them to point to some specific questions where this type of analysis may be useful in providing insights to support current or planned work. The following questions were identified:

- Which causes are most frequently critical to multiple different Severe incidents?
- Which causes are co-dependent?
- Which causes are frequently critical to risks in very different domains (e.g. causes that link harms from malfunction, diffuse harms from widespread use, harms from criminality and misuse)?

This suggests a potential pathway for future work. It would also be of great interest to look for correlations between certain cause clusters and financial cost of incidents, should that data be made available.

Shortcomings of Root Cause Analysis

RCA commonly focuses on trying to find one single or dominant Root Cause. For incidents arising in complex systems, there will frequently be a number of causes and the event is the result of interactions between them and emergent behaviour. For this reason Five Whys analysis is unlikely to provide adequate insights into causality as it is focused on a single dimension. Fault Tree Analysis improves on Five Whys by analysing how multiple base events work together in combination (AND / OR logic gates). It can be useful to uncover some high level insights but is still reductionist and owing to the complexity of systems involving AI, will frequently result in oversimplifications that may cause some dependencies to be overlooked. This will be exacerbated as systems increase in complexity and data becomes sparser, particularly as we look ahead to try to understand future events.

System Theoretic Process Analysis (STPA)

STPA is a Systems Safety framework developed by Nancy Leveson at MIT. It addresses some of the challenges associated with RCA and is suitable for analysing extremely complex systems with many unknowns and where risk comes from emergent behaviours. Dobbe (2022) discusses some potential tools and approaches to AI safety, based around STPA.

Also developed by Leveson, STAMP (Systems Theoretic Accident Model and Processes) and CAST (Causal Analysis Based on STAMP) offer systematic methods to model such complex systems. An opportunity exists to apply the CAST framework to real-world historic AI safety incidents in order to increase learning from them.

Looking ahead to modelling potential future hazards with a view to evaluating risk mitigation and harm reduction measures the STPA approach can be used to identify hazards, define safety constraints and model control structures of the system which then informs the identification of unsafe control actions. By following this process, risk analysts would be better placed to understand the causal scenarios that could lead to the unsafe control actions and from there design and implement systems to prevent them occurring. I am very keen to investigate how Language Models can be used as a tool to support such analyses, in a way that increases scalability and improves objectivity compared with what a purely human research team could achieve.

In any case, the nature of AI risk makes it likely that existing frameworks from other domains will need to be adapted to address the novel idiosyncrasies of AI and the pace at which it is evolving.

Appendix

Example query response from an AISI Inspect log file including the Chain of Thought reasoning: Incident 74 Example Output CoT top-p0.9.pdf
Example log file from the AISI Inspect framework for a sample run of 20 queries asking for the most significant causal factor of Incident 74, applying a Top-P parameter of 0.9 and No Chain of Thought reasoning: Incident 74 20x no-CoT top-p0.9.pdf
Clustered Causes - Full List: an Excel file containing Overall clustering and then one further tab per cause area, showing which causes have been assigned to which cluster.

Validity and Clustering of AI Safety Incident Causality Analysis

Summary

AI Safety Fundamentals Capstone Project

Introduction

Cluster Analysis of Causes

Objectives of Root Cause Analysis: Validity Requirements

Comparison with Human Analysis

Baseline Reliability of LLM RCA Analyses

Improvements to RCA Reliability and Epistemology

Challenges with Evaluation Methodology

Evaluation Framework: AISI Inspect

Evaluation Task

Dataset

Plan

Scorer

Model

Running

Experiment 1: Chain of Thought Reasoning

Results

Analysis

Experiment 2: Temperature / Top-P Parameter Values

Results

Analysis

Cluster Analysis

Specific Questions

Data Generation for Cluster Analysis

1. The taxonomic analysis of the incident as created by the AIID editing team.

2. The contents of the reports that the AIID team have associated with each incident.

Cluster Analysis Methodology

Results

Silhouette Scores

Interpretation of Silhouette Scores

Overall - Silhouette Score 0.35

Technology - Silhouette Score: 0.40

Data Inputs - Silhouette Score: 0.45

Human Factors - Silhouette Score: 0.37

Process and Methods - Silhouette Score: 0.42

Regulatory Environment - Silhouette Score: 0.45

Management: Silhouette Score: 0.39

Cluster Analysis Conclusions

Shortcomings of Root Cause Analysis

System Theoretic Process Analysis (STPA)

Appendix

Scalable AI Incident Classification

Root Cause Analysis of AI Safety Incidents