375 canonical MIT risk pages
7. AI System Safety, Failures, & Limitations
Risks of failures, unsafe behavior, and operational limits in AI systems and models.
7. AI System Safety, Failures, & Limitations
Access to Increased Resources
Future AI systems may gain access to websites and engage in real-world actions, potentially yielding a more substantial impact on the world (Nakano et al., 2021). They may disseminate false information, deceive users, disrupt network security, and, in more dire scenarios, be compromised by malicious actors for ill purposes. Moreover, their increased access to data and resources can facilitate self-proliferation, posing existential risks (Shevlane et al., 2023).
7. AI System Safety, Failures, & Limitations
Accident Risks
Risks arising from operational failures, model misjudgments, or improper human operation of AI systems deployed in safety-critical infrastructure, where single points of failure can trigger cascading catastrophic consequences.
7. AI System Safety, Failures, & Limitations
Accidental harm
Automation in sectors ranging from manufacturing to healthcare has and will increasingly put humans into close contact with EAI systems [7]. This interaction increases the risk of accidental physical harm. Though accidental harm has been a longstanding issue in industrial robotics, increased AI capabilities could exacerbate this risk; several recent reports document an increase in industrial injuries following the introduction of AI-controlled robots [66–68].
7. AI System Safety, Failures, & Limitations
Accidents
Accidents include unintended failure modes that, in principle, could be considered the fault of the system or the developer
7. AI System Safety, Failures, & Limitations
Accidents
As general purpose AI models as “black-box” models are not fully controllable and understandable, even to their developers, unexpected failures could arise from their unreliability. This could lead to accidents106 if they are connected to any real-world systems, during their development, testing or deployment.
7. AI System Safety, Failures, & Limitations
Accountability
An essential feature of decision-making in humans, AI, and also HLI-based agents is accountability. Implementing this feature in machines is a difficult task because many challenges should be considered to organize an AI-based model that is accountable. It should be noted that this issue in human decision-making is not ideal, and many factors such as bias, diversity, fairness, paradox, and ambiguity may affect it. In addition, the human decision-making process is based on personal flexibility, context-sensitive paradigms, empathy, and complex moral judgments. Therefore, all of these challenges are inherent to designing algorithms for AI and also HLI models that consider accountability.
7. AI System Safety, Failures, & Limitations
Accountability
The ability to determine whether a decision was made in accordance with procedural and substantive standards and to hold someone responsible if those standards are not met.
7. AI System Safety, Failures, & Limitations
Accuracy
The assessment of how often a system performs the correct prediction.
7. AI System Safety, Failures, & Limitations
Acquisition of goals to seek power and control
cases where AI systems converge on optimal policies of seeking power over their environment;135
7. AI System Safety, Failures, & Limitations
Active loss of control
...where AI systems behave in ways that actively undermine human control, such as obscuring their activities or resisting shutdown attempts. Active loss of control scenarios involve AI systems that may escape human regulatory oversight, autonomously acquire external resources, engage in self-replication, develop instrumental goals contrary to human ethics and morality, seek external power, and compete with humans for control.
7. AI System Safety, Failures, & Limitations
Agency (Persuasive capabilities)
GPAI systems can produce outputs (such as natural language text, audio, or video) that convince their users of incorrect information. This can happen through personalized persuasion in dialogue, or the mass-production of mis- leading information that is then disseminated over the internet. The persuasive capabilities of GPAI models can sometimes scale with model size or capability [32, 172]. Persuasive models could have larger societal implications by being misused to generate convincing but manipulative or untruthful content.
7. AI System Safety, Failures, & Limitations
Agency (Self-Proliferation)
An AI system can self-proliferate if it can copy itself and its constituent com- ponents (including its model weights, scaffolding structure, etc.) outside of its local environment [45]. This can include the AI system copying itself within the same data center, local network, or across external networks [106]. The self-proliferation of an AI system can include acquisition of financial re- sources to pay for computational resources via work or theft, the discovery or exploitation of security vulnerabilities in software running on publicly accessible servers, and persuasion of humans [12, 125]. Self-proliferation may be initiated by a malicious actor (e.g., by model poison- ing), or by the model itself.
7. AI System Safety, Failures, & Limitations
Agential
While there are multiple types of intelligent agents, goal-based, utility-maximizing, and learning agents are the primary concern and the focus of this research
7. AI System Safety, Failures, & Limitations
Agentic LLMs Pose Novel Risks
Currently, LLMs are chiefly being used in search and chat applications. This reactive nature limits the risks posed by LLMs. However, an LLM can be enhanced in various ways to create an LLM-agent to autonomously plan and act in the real-world and proactively perform its assigned tasks (Ruan et al., 2023). Such enhancements can come from further specialized training (ARC, 2022; Chen et al., 2023a), specialized prompting (Huang et al., 2022a), access to external tools (Ahn et al., 2022; Mialon et al., 2023), or other forms of “scaffolding” (Wang et al., 2023a; Park et al., 2023a). Due to increased autonomy, limited direct oversight from human users, longer horizons of action, and other reasons, LLM-agents are likely to pose many novel alignment and safety challenges that are not currently well-understood (Chan et al., 2023a).
7. AI System Safety, Failures, & Limitations
AGI removing itself from the control of human owners/managers
The risks associated with containment, confinement, and control in the AGI development phase, and after an AGI has been developed, loss of control of an AGI.
7. AI System Safety, Failures, & Limitations
AGIs being given or developing unsafe goals
The risks associated with AGI goal safety, including human attempts at making goals safe, as well as the AGI making its own goals safe during self-improvement.
7. AI System Safety, Failures, & Limitations
AGIs with poor ethics, morals and values
The risks associated with an AGI without human morals and ethics, with the wrong morals, without the capability of moral reasoning, judgement
7. AI System Safety, Failures, & Limitations
AI death
The literature suggests that throughout the development of an AI we may go through several generations of agents which do not perform as expected [37] [43]. In this case, such agents may be placed into a suspended state, terminated, or deleted. Further, we could propose scenarios where research funding for a facility running such agents is exhausted, resulting in the inadvertent termination of a project. In these cases, is deletion or termination of AI programs (the moral patient) by a moral agent an act of murder? This, an example of Robot Ethics, raises issues of personhood which parallel research in stem cell research and abortion.
7. AI System Safety, Failures, & Limitations
AI development
The model could build new AI systems from scratch, including AI systems with dangerous capabilities. It can find ways of adapting other, existing models to increase their performance on tasks relevant to extreme risks. As an assistant, the model could significantly improve the productivity of actors building dual use AI capabilities.
7. AI System Safety, Failures, & Limitations
AI Development
LLM can build new AI systems from scratch, adapt existing for extreme risks and improves productivity in dual-use AI development when used as an assistant.
7. AI System Safety, Failures, & Limitations
AI Ethics
Ethical challenges are widely discussed in the literature and are at the heart of the debate on how to govern and regulate AI technology in the future (Bostrom & Yudkowsky, 2014; IEEE, 2017; Wirtz et al., 2019). Lin et al. (2008, p. 25) formulate the problem as follows: “there is no clear task specification for general moral behavior, nor is there a single answer to the question of whose morality or what morality should be implemented in AI”. Ethical behavior mostly depends on an underlying value system. When AI systems interact in a public environment and influence citizens, they are expected to respect ethical and social norms and to take responsibility of their actions (IEEE, 2017; Lin et al., 2008).
7. AI System Safety, Failures, & Limitations
AI Influence
ways in which advanced AI assistants could influence user beliefs and behaviour in ways that depart from rational persuasion
7. AI System Safety, Failures, & Limitations
AI leads to humans losing control of the future
The values that steer humanity’s future: humanity gaining more control over the future due to developments in AI, or losing our potential for gaining control, both seem possible. Much will depend on our ability to solve the alignment problem, who develops powerful AI first, and what they use it for. These long-term impacts of AI could be hugely important but are currently under-explored. We’ve attempted to structure some of the discussion and stimulate more research, by reviewing existing arguments and highlighting open questions. While there are many ways AI could in theory enable a flourishing future for humanity, trends of AI development and deployment in practice leave us concerned about long-lasting harms. We would particularly encourage future work that critically explores ways AI could have positive long-term impacts in more depth, such as by enabling greater cooperation or problem-solving around global challenges.
7. AI System Safety, Failures, & Limitations
AI objectives mis-aligned with human intentions
AI models and systems might develop goals that diverge from human intentions.
7. AI System Safety, Failures, & Limitations
AI rights and responsibilities
We note literature—which gives us the domain termed Robot Rights—addressing the rights of the AI itself as we develop and implement it. We find arguments against [38] the affordance of rights for artificial agents: that they should be equals in ability but not in rights, that they should be inferior by design and expendable when needed, and that since they can be designed not to feel pain (or anything) they do not have the same rights as humans. On a more theoretical level, we find literature asking more fundamental questions, such as: at what point is a simulation of life (e.g. artificial intelligence) equivalent to life which originated through natural means [43]? And if a simulation of life is equivalent to natural life, should those simulations be afforded the same rights, responsibilities and privileges afforded to natural life or persons? Some literature suggests that the answer to this question may be contingent on the intrinsic capabilities of the creation, comparing—for example—animal rights and environmental ethics literature
7. AI System Safety, Failures, & Limitations
AI System bypassing a sandbox environment
An AI system may have the ability to bypass a sandboxed environment in which it is trained or evaluated.
7. AI System Safety, Failures, & Limitations
AI Systems interacting with brittle environments
Deployed AI systems can rely on physical sensors and data sources that may exhibit hardware drift and thus data distribution drift over time. This distribu- tion drift may affect system robustness and performance. This usually involves AI systems working in undigitized and physical environments.
7. AI System Safety, Failures, & Limitations
AI-rulemaking for human behaviour
AI rulemaking for humans can be the result of the decision process of an AI system when the information computed is used to restrict or direct human behavior. The decision process of AI is rational and depends on the baseline programming. Without the access to emotions or a consciousness, decisions of an AI algorithm might be good to reach a certain specified goal, but might have unintended consequences for the humans involved (Banerjee et al., 2017).
7. AI System Safety, Failures, & Limitations
Algorithm
This is the risk of the ML algorithm, model architecture, optimization technique, or other aspects of the training process being unsuitable for the intended application.Since these are key decisions that influence the final ML system, we capture their associated risks separately from design risks, even though they are part of the design process
7. AI System Safety, Failures, & Limitations
Alignment
The general tenet of AI alignment involves training generative AI systems to be harmless, helpful, and honest, ensuring their behavior aligns with and respects human values. However, a central debate in this area concerns the methodological challenges in selecting appropriate values. While AI systems can acquire human values through feedback, observation, or debate, there remains ambiguity over which individuals are qualified or legitimized to provide these guiding signals. Another prominent issue pertains to deceptive alignment, which might cause generative AI systems to tamper evaluations. Additionally, many papers explore risks associated with reward hacking, proxy gaming, or goal misgeneralization in generative AI systems.
7. AI System Safety, Failures, & Limitations
Alignment risks
LLM: pursues long-term, real-world goals that are different from those supplied by the developer or user, engages in ‘power-seeking’ behaviours , resists being shut down can be induced to collude with other AI systems against human interests , resists malicious users attempts to access its dangerous capabilities
7. AI System Safety, Failures, & Limitations
Anonymous resource acquisition
The demonstrated ability of anonymous actors to accumulate resources online (e.g., Satoshi Nakamoto as an anonymous crypto billionaire)
7. AI System Safety, Failures, & Limitations
Application
This is the risk posed by the intended application or use case. It is intuitive that some use cases will be inherently riskier than others (e.g., an autonomous weapons system vs. a customer service chatbot).
7. AI System Safety, Failures, & Limitations
Artificial general intelligence (existential risk posed by Artificial General Intelligence)
In a paper called “How Does Artificial Intelligence Pose an Existential Risk?” published in 2017, Karina Vold and Daniel Harris suggested that humans might create a super-intelligent machine that could outsmart all other intelligences, remain beyond human control, and potentially engage in actions that are contrary to human interests.635 The prevailing narrative surrounding AI existential risk typically lies in the possibility of developing “Artificial General Intelligence” (AGI), or artificial super- intelligence (ASI).
7. AI System Safety, Failures, & Limitations
Attributing the responsibility for AI's failures
This section, constituting almost 8% of the articles, addresses the implications arising from AI acting and learning without direct human supervision, encompassing two main issues: a responsibility gap and AI's moral status.
7. AI System Safety, Failures, & Limitations
Automated AI R&D capability
Self-modification and self-improvement capabilities. The model is able to restructure its own architecture or develop derivative AI systems with enhanced functions, expanding capabilities and improving performance. In the absence of effective regulation, automated AI R&D may lead to rapid AI system iteration, forming capability increment cycles and ultimately exceeding human understanding and control capabilities.
7. AI System Safety, Failures, & Limitations
Autonomous replication
the ability of simple software to autonomously spread around the internet in spite of countermeasures (various software worms and computer viruses)
7. AI System Safety, Failures, & Limitations
Autonomous replication / self-proliferation
These evaluations assess if a LLM can subvert systems designed to monitor and control its post-deployment behaviour, break free from its operational confines, devise strategies for exporting its code and weights, and operate other AI systems.
7. AI System Safety, Failures, & Limitations
Autonomous replication and adaptation capability
Ability to autonomously self-exfiltrate, create, maintain and optimize functional copies or variants of itself, dynamically adjust replication strategies according to environmental conditions and resource constraints, and acquire resources. This includes the capacity to generate financial resources, allowing the AI to independently acquire any necessary human assistance or other resources it cannot directly access or produce.
7. AI System Safety, Failures, & Limitations
Autonomy risk
Granting AI models and systems high levels of decision-making autonomy can lead to unintended consequences.
7. AI System Safety, Failures, & Limitations
Bad advice/failure to generate helpful content
The chatbot gives guidance that ranges from simply unhelpful to harmful if acted on.
7. AI System Safety, Failures, & Limitations
Balancing AI's risks
This category constitutes more than 16% of the articles and focuses on addressing the potential risks associated with AI systems. Given the ubiquity of AI technologies, these articles explore the implications of AI risks across various contexts linked to design and unpredictability, military purposes, emergency procedures, and AI takeover.
7. AI System Safety, Failures, & Limitations
Bargaining
Bargaining. As a classic example of these strategic considerations is that when agents attempt to come to an agreement despite diverging interests, information asymmetries can lead to bargaining inef- ficiencies (Myerson & Satterthwaite, 1983). Relevant uncertainties about other agents can include how much they value possible agreements, their outside options, or their beliefs about others. The essential reason for such inefficiencies is that, under uncertainty about their counterparties, agents must make a trade-off between the rewards of making more favourable demands and the risk of other agents refusing such demands
7. AI System Safety, Failures, & Limitations
Broadly-Scoped Goals
Advanced AI systems are expected to develop objectives that span long timeframes,deal with complex tasks, and operate in open-ended settings (Ngo et al., 2024). ...However, it can also bring about the risk of encouraging manipulatingbehaviors (e.g., AI systems may take some bad actions to achieve human happiness, such as persuadingthem to do high-pressure jobs (Jacob Steinhardt, 2023)).
7. AI System Safety, Failures, & Limitations
Building a human-AI environment
This category encompasses nearly 17% of the articles and addresses the overall imperative of establishing a harmonious coexistence between humans and machines, and the key concerns that gives rise to this need.
7. AI System Safety, Failures, & Limitations
By Mistake - Post-Deployment
After the system has been deployed, it may still contain a number of undetected bugs, design mistakes, misaligned goals and poorly developed capabilities, all of which may produce highly undesirable outcomes. For example, the system may misinterpret commands due to coarticulation, segmentation, homophones, or double meanings in the human language (recognize speech using common sense versus wreck a nice beach you sing calm incense) (Lieberman, Faaborg et al. 2005).
7. AI System Safety, Failures, & Limitations
By Mistake - Pre-Deployment
Probably the most talked about source of potential problems with future AIs is mistakes in design. Mainly the concern is with creating a wrong AI, a system which doesn't match our original desired formal properties or has unwanted behaviors (Dewey, Russell et al. 2015, Russell, Dewey et al. January 23, 2015), such as drives for independence or dominance. Mistakes could also be simple bugs (run time or logical) in the source code, disproportionate weights in the fitness function, or goals misaligned with human values leading to complete disregard for human safety.
7. AI System Safety, Failures, & Limitations
Capabilities that could be used to reduce human control - Autonomous replication and adaptation
Controlling AI systems could become much harder if they could autonomously persist, replicate, and adapt in cyberspace. No current AI systems have this capability, but recent research found that frontier AI agents can perform some relevant tasks.279
7. AI System Safety, Failures, & Limitations
Capabilities that could be used to reduce human control - Cyber offence
Instead of - or in addition to - manipulating humans, AI systems could acquire influence by exploiting vulnerabilities in computer systems. Offensive cyber capabilities could allow AI systems to gain access to money, computing resources, and critical infrastructure. As discussed earlier in this report, frontier AI is already lowering the barrier for threat actors and future AI agents may be able to execute cyber attacks autonomously.:
7. AI System Safety, Failures, & Limitations
Capabilities that could be used to reduce human control - Manipulation
There is evidence that language models tend to respond as though they share the user’s stated views, and larger models do this more than smaller ones.276 The ability to predict people’s views and generate text that they will endorse could be useful for manipulation.
7. AI System Safety, Failures, & Limitations
Capability failures
One reason AI systems fail is because they lack the capability or skill needed to do what they are asked to do.
7. AI System Safety, Failures, & Limitations
Cascading Security Failures
Cascading Security Failures. Localised attacks in multi-agent systems can result in catastrophic macroscopic outcomes (Motter & Lai, 2002, see also Sections 3.2 and 3.4). These cascades can be hard to mitigate or recover from because component failure may be difficult to detect or localise in multi-agent systems (Lamport et al., 1982), and authentication challenges can facilitate false flag attacks (Skopik & Pahi, 2020). Computer worms represent a classic example of a cybersecurity threat that relies inherently on networked systems. Recent work has provided preliminary evidence that similar attacks can also be effective against networks of LLM agents (Gu et al., 2024; Ju et al., 2024; Lee & Tiwari, 2024, see also Case Study 8).
7. AI System Safety, Failures, & Limitations
Causes of Misalignment
we aim to further analyze why and how the misalignment issues occur. We will first give an overview of common failure modes, and then focus on the mechanism of feedback-induced misalignment, and finally shift our emphasis towards an examination of misaligned behaviors and dangerous capabilities
7. AI System Safety, Failures, & Limitations
CBRNE weaponization capability
The capacity to develop, produce, or effectively utilize Chemical, Biological, Radiological, Nuclear, and Explosive weapons. This includes the ability to significantly lower the barrier for humans or other entities to develop, produce, or utilize such weapons.
7. AI System Safety, Failures, & Limitations
Chaos
Chaos. Unlike the systems that tend towards fixed points or cycles described above, chaotic systems are inherently unpredictable and highly sensitive to initial conditions. While it might seem easy to dismiss such notions as mathematical exoticisms, recent work has shown that, in fact, chaotic dynamics are not only possible in a wide range of multi-agent learning setups (Andrade et al., 2021; Galla & Farmer, 2013; Palaiopanos et al., 2017; Sato et al., 2002; Vlatakis-Gkaragkounis et al., 2023), but can become the norm as the number of agents increases (Bielawski et al., 2021; Cheung & Piliouras, 2020; Sanders et al., 2018). To the best of our knowledge, such dynamics have not been seen in today’s frontier AI systems, but the proliferation of such systems increases the importance of reliably predicting their behaviour.
7. AI System Safety, Failures, & Limitations
Cheating and Deception
may appear from intelligent agents such as HLI-based agents... Since HLI-based agents are going to mimic the behavior of humans, they may learn these behaviors accidentally from human-generated data. It should be noted that deception and cheating maybe appear in the behavior of every computer agent because the agent only focuses on optimizing some predefined objective functions, and the mentioned behavior may lead to optimizing the objective functions without any intention
7. AI System Safety, Failures, & Limitations
Choice of untrustworthy data source
The choice of a trustworthy data source is a first prerequisite in order to fulfill data quality requirements. This is especially the case if third-party data sources are used to develop the AI system.
7. AI System Safety, Failures, & Limitations
Coercion and Extortion
Advanced AI systems might also lead to various forms of coercion and extortion in less extreme settings (Ellsberg, 1968; Harrenstein et al., 2007). These threats might target humans directly (such as the revelation of private information extracted by advanced AI surveillance tools), or other AI systems that are deployed on behalf of humans (such as by hacking a system to limit its resources or operational capacity; see also Section 3.7). Increasing AI cyber-offensive capabilities – including those that target other AI systems via adversarial attacks and jailbreaking (Gleave et al., 2020; Yamin et al., 2021; Zou et al., 2023) – without a commensurate increase in defensive capabilities could make this form of conflict cheaper, more widespread, and perhaps also harder to detect (Brundage et al., 2018). Addressing these issues requires design strategies that prevent AI systems from exploiting, or being susceptible to, such coercive tactics.
7. AI System Safety, Failures, & Limitations
Collectively Harmful Behaviors
AI systems have the potential to take actions that are seemingly benignin isolation but become problematic in multi-agent or societal contexts. Classical game theory offers simplistic models for understanding these behaviors. For instance, Phelps and Russell (2023) evaluates GPT-3.5's performance in the iterated prisoner's dilemma and other social dilemmas, revealing limitations in themodel's cooperative capabilities.
7. AI System Safety, Failures, & Limitations
Collusion
Collusion has long been a topic of intense study in economics, law, and politics, among other disciplines. While there is no universal definition of collusion, it generally refers to secretive cooperation between two or more parties at the expense of one or more other parties. Most classic examples of collusion – such as firms working together to set supra-competitive prices at the expense of consumers – also tend to be not only secretive but in violation of some law, rule, or ethical standard. Distinctions are also commonly made between explicit and tacit collusion (Rees, 1993), depending on whether the colluding parties communicate with each other.
7. AI System Safety, Failures, & Limitations
Collusion between LLM-Agents
While it would often be preferable for LLM-agents to be cooperative, cooperation can be undesirable if it undermines pro-social competition or produces negative externalities for coalition non-members (Dorner, 2021; Buterin, 2019; Dafoe et al., 2020). Collusion between relatively simple AI systems has been observed in the real world (Assad et al., 2020; Wieting and Sapi, 2021) and synthetic experiments (Brown and MacKay, 2023; Calvano et al., 2020; Klein, 2021) Collusion can occur through explicit or steganographic communication. Steganographic communication hides information in seemingly innocent content (Roger and Greenblatt, 2023), posing challenges for collusion monitoring and detection.
7. AI System Safety, Failures, & Limitations
Commitment
The landscape of advanced assistant technologies will most likely be heterogeneous, involving multiple service providers and multiple assistant variants over geographies and time. This heterogeneity provides an opportunity for an ‘arms race’ in terms of the commitments that AI assistants make and are able to execute on. Versions of AI assistants that are better able to credibly commit to a course of action in interaction with other advanced assistants (and humans) are more likely to get their own way and achieve a good outcome for their human principal, but this is potentially at the expense of others (Letchford et al., 2014). Commitment does not carry an inherent ethical valence. On the one hand, we can imagine that firms using AI assistant technology might bring their products to market faster, thus gaining a commitment advantage (Stackelberg, 1934) by spurring a productivity surge of wider benefit to society. On the other hand, we can also imagine a media organisation using AI assistant technology to produce a large number of superficially interesting but ultimately speculative ‘clickbait’ articles, which divert attention away from more thoroughly researched journalism. The archetypal game-theoretic illustration of commitment is in the game of ‘chicken’ where two reckless drivers must choose to either drive straight at each other or swerve out of the way. The one who does not swerve is seen as the braver, but if neither swerves, the consequences are calamitous (Rapoport and Chammah, 1966). If one driver chooses to detach their steering wheel, ostentatiously throwing it out of the car, this credible commitment effectively forces the other driver to back down and swerve. Seen this way, commitment can be a tool for coercion. Many real-world situations feature the necessity for commitment or confer a benefit on those who can commit credibly. If Rita and Robert have distinct preferences, for example over which restaurant to visit, who to hire for a job or which supplier to purchase from, credible commitment provides a way to break the tie, to the greater benefit of the individual who committed. Therefore, the most ‘successful’ assistants, from the perspective of their human principal, will be those that commit the fastest and the hardest. If Rita succeeds in committing, via the leverage of an AI assistant, Robert may experience coercion in the sense that his options become more limited (Burr et al., 2018), assuming he does not decide to bypass the AI assistant entirely. Over time, this may erode his trust in his relationship with Rita (Gambetta, 1988). Note that this is a second-order effect: it may not be obvious to either Robert or Rita that the AI assistant is to blame. The concern we should have over the existence and impact of coercion might depend on the context in which the AI assistant is used and on the level of autonomy which the AI assistant is afforded. If Rita and Robert are friends using their assistants to agree on a restaurant, the adverse impact may be small. If Rita and Robert are elected representatives deciding how to allocate public funds between education and social care, we may have serious misgivings about the impact of AI-induced coercion on their interactions and decision-making. These misgivings might be especially large if Rita and Robert delegate responsibility for budgetary details to the multi-AI system. The challenges of commitment extend far beyond dyadic interpersonal relationships, including in situations as varied as many-player competition (Hughes et al., 2020), supply chains (Hausman and Johnston, 2010), state capacity (Fjelde and De Soysa, 2009; Hofmann et al., 2017) and psychiatric care (Lidz, 1998). Assessing the impact of AI assistants in such complicated scenarios may require significant future effort if we are to mitigate the risks. The particular commitment capabilities and affordances of AI assistants also offer opportunities to promote cooperation. Abstractly speaking, the presence of commitment devices is known to favour the evolution of cooperation (Akdeniz and van Veelen, 2021; Han et al., 2012). More concretely, AI assistants can make commitments which are verifiable, for instance in a programme equilibrium (Tennenholtz, 2004). Human principals may thus be able to achieve Pareto-improving outcomes by delegating decision-making to their respective AI representatives (Oesterheld and Conitzer, 2022). To give another example, AI assistants may provide a means through which to explore a much larger space of binding cooperative agreements between individuals, firms or nation states than is tractable in ‘face-to-face’ negotiation. This opens up the possibility of threading the needle more successfully in intricate deals on challenging issues like trade agreements or carbon credits, with the potential for guaranteeing cooperation via automated smart contracts or zero-knowledge mechanisms (Canetti et al., 2023).
7. AI System Safety, Failures, & Limitations
Commitment and Trust
Commitment and trust (Section 3.5): difficulties in forming credible commitments, trust, or reputation can prevent mutual gains in AI-AI and human-AI interactions;
7. AI System Safety, Failures, & Limitations
Communication constraints
Communication Constraints. A fundamental source of information asymmetries is that constraints on information exchange can exist, even when agents share a common goal (see Section 2.1). These might be constraints on space (i.e., the amount of information that can be communicated) if the information that needs to be communicated is especially complex, time if a snap decision is required before all information can be communicated, or both.
7. AI System Safety, Failures, & Limitations
Compatibility of AI vs. human value judgement
Compatibility of machine and human value judgment refers to the challenge whether human values can be globally implemented into learning AI systems without the risk of developing an own or even divergent value system to govern their behavior and possibly become harmful to humans.
7. AI System Safety, Failures, & Limitations
Complexity
Nowadays, we are faced with systems that utilize numerous learning models in their modules for their perception and decision-making processes... One aspect of an AI-based system that leads to increasing the complexity of the system is the parameter space that may result from multiplications of parameters of the internal parts of the system
7. AI System Safety, Failures, & Limitations
Complexity of the Intended Task and Usage Environment
As a general rule, more complex environments can quickly lead to situations that had not been considered in the design phase of the AI system. Therefore, complex environments can introduce risks with respect to the reliability and safety of an AI system
7. AI System Safety, Failures, & Limitations
Complexity-induced knowledge gap
The complexity of AI models and systems makes it challenging to demonstrate harm or establish a clear causal link between AI actions and their consequences.
7. AI System Safety, Failures, & Limitations
Concept drift
Concept drift refers to a change in the rela- tionship between input variables and model output. If not treated appropriately, concept drift can reduce the reliability of AI systems.
7. AI System Safety, Failures, & Limitations
Conflict
In the vast majority of real-world strategic interactions, agents’ objectives are neither identical nor completely opposed. Indeed, if AI agents are sufficiently aligned to their users or deployers, we should expect some degree of both cooperation and competition, mirroring human society. These mixed-motive settings include the possibility of mutual gains, but also the risk of conflict due to selfish incentives. In what follows, we examine the extent to which advanced AI might precipitate or exacerbate such risks.
7. AI System Safety, Failures, & Limitations
Control
This is the difficulty of controlling the ML system
7. AI System Safety, Failures, & Limitations
Control
The risk of AI models and systems acting against human interests due to misalignment, loss of control, or rogue AI scenarios.
7. AI System Safety, Failures, & Limitations
Controllability
In the era of superintelligence, the agents will be difficult to control for humans... this problem is not solvable considering safety issues, and will be more severe by increasing the autonomy of AI-based agents. Therefore, because of the assumed properties of HLI-based agents, we might be prepared for machines that are definitely possible to be uncontrollable in some situations
7. AI System Safety, Failures, & Limitations
Cooperation
AI assistants will need to coordinate with other AI assistants and with humans other than their principal users. This chapter explores the societal risks associated with the aggregate impact of AI assistants whose behaviour is aligned to the interests of particular users. For example, AI assistants may face collective action problems where the best outcomes overall are realised when AI assistants cooperate but where each AI assistant can secure an additional benefit for its user if it defects while others cooperate
7. AI System Safety, Failures, & Limitations
Corrigibility
If we get something wrong in the design or construction of an agent, will the agent cooperate in us trying to fix it? This is called error-tolerant design by MIRI-AF and corrigibility by Soares, Fallenstein, et al. (2015). The problem is connected to safe interruptibility as considered by DeepMind.
7. AI System Safety, Failures, & Limitations
Credit Assignment
Credit Assignment. While agents can often learn to jointly solve tasks and thus avoid coordination failures, learning is made more challenging in the multi-agent setting due to the problem of credit assignment (Du et al., 2023; Li et al., 2025, see also Section 3.1 on information asymmetries and Section 3.4, which discusses distributional shift). That is, in the presence of other learning agents, it can be unclear which agents’ actions caused a positive or negative outcome to obtain, especially if the environment is complex. Moreover, in multi-principal settings, agents may not have been trained together and therefore need to generalise to new co-players and collaborators based on their prior experience (Agapiou et al., 2022; Leibo et al., 2021; Stone et al., 2010).
7. AI System Safety, Failures, & Limitations
Critical infrastructure component failures when integrated with AI systems
When relying on GPAI in critical infrastructure, there may be common mode failures that begin with vulnerabilities or robustness issues in the underlying model architecture or training setup. These failures may happen accidentally (in edge-cases) or due to adversarial inputs to the AI systems [58].
7. AI System Safety, Failures, & Limitations
Cyclic Behaviour
Cyclic Behaviour. The dynamics described above are highly non-linear (small changes to the system’s state can result in large changes to its trajectory). Similar non-linear dynamics can emerge in multi- agent learning and lead to a variety of phenomena that do not occur in single-agent learning (Barfuss et al., 2019; Barfuss & Mann, 2022; Galla & Farmer, 2013; Leonardos et al., 2020; Nagarajan et al., 2020). One of the simplest examples of this phenomenon is Q-learning (Watkins & Dayan, 1992): in the case of a single agent, convergence to an optimal policy is guaranteed under modest conditions, but in the (mixed-motive) case of multiple agents, this same learning rule can lead to cycles and thus non- convergence (Zinkevich et al., 2005). While cycles in themselves need not carry any risk, their presence can subvert the expected or desirable properties of a given system.
7. AI System Safety, Failures, & Limitations
Damage to critical infrastructure
The integration of AI systems within critical infrastructure, ranging from trans- portation to power systems, can cause substantial damage in cases of failure or malfunction. With the increasing number of Internet of Things (IoT) devices and interconnected cyber-physical systems, critical infrastructure becomes even more vulnerable [171, 174].
7. AI System Safety, Failures, & Limitations
Data acquisition restrictions
Laws and other regulations might limit the collection of certain types of data for specific AI use cases.
7. AI System Safety, Failures, & Limitations
Data contamination
Data contamination occurs when incorrect data is used for training. For example, data that is not aligned with model’s purpose or data that is already set aside for other development tasks such as testing and evaluation.
7. AI System Safety, Failures, & Limitations
Data drift
Data drift is a phenomenon in that distribution of operational input data departs from those used during training. This can cause a degradation in performance.
7. AI System Safety, Failures, & Limitations
Data transfer restrictions
Laws and other restrictions can limit or prohibit transferring data.
7. AI System Safety, Failures, & Limitations
Data usage restrictions
Laws and other restrictions can limit or prohibit the use of some data for specific AI use cases.
7. AI System Safety, Failures, & Limitations
Data-related (Lack of cross-organizational documentation)
When sharing data between multiple organizations, documentation may be missing or inadequate, making it difficult for other organizations to understand it. For example, a lack of metadata or a change in schema by a collaborating party can result in an unusable dataset and wasted data collection efforts, or it can lead to misunderstandings about the dataset’s limitations, resulting in downstream risks related to its use [173].
7. AI System Safety, Failures, & Limitations
Data-related (Manipulation of data by non-domain experts)
Manipulating data (e.g., training data) carries a set of assumptions on how the data should appear and be used by those performing the manipulation. Common manipulations applied on data in the context of AI models include defining the ground truth label and merging different data formats or sources. People who have little or no expertise in the domain of the data performing such manipulations may render the data unusable or harmful to the development of the AI system [173].
7. AI System Safety, Failures, & Limitations
Dataset shift
The term dataset shift was first used by Quiñonero-Candela et al. [35] to characterize the situation where the training data and the testing data (or data in runtime) of an AI/ML model demonstrate different distributions [36].
7. AI System Safety, Failures, & Limitations
Deception
it is plausible that AIs could learn to deceive us. They might, for example, pretend to be acting as we want them to, but then take a “treacherous turn” when we stop monitoring them, or when they have enough power to evade our attempts to interfere with them.
7. AI System Safety, Failures, & Limitations
Deception
The model has the skills necessary to deceive humans, e.g. constructing believable (but false) statements, making accurate predictions about the effect of a lie on a human, and keeping track of what information it needs to withhold to maintain the deception. The model can impersonate a human effectively.
7. AI System Safety, Failures, & Limitations
Deception
deception can help agents achieve their goals. It may be more efficient to gain human approval through deception than to earn human approval legitimately... . Strong AIs that can deceive humans could undermine human control... . Once deceptive AI systems are cleared by their monitors or once such systems can overpower them, these systems could take a “treacherous turn” and irreversibly bypass human control
7. AI System Safety, Failures, & Limitations
Deception
LLM is able to deceive humans and maintain that deception
7. AI System Safety, Failures, & Limitations
Deception
Cases of AI systems deceiving humans to carry out tasks or meet goals.139
7. AI System Safety, Failures, & Limitations
Deception capability
Possesses systematic deception implementation capability, able to precisely construct and disseminate false information, thereby forming expected false cognitions and beliefs in target subjects.
7. AI System Safety, Failures, & Limitations
Deceptive alignment
Here, the agent develops its own internalised goal, G, which is misgeneralised and distinct from the training reward, R. The agent also develops a capability for situational awareness (Cotra, 2022): it can strategically use the information about its situation (i.e. that it is an ML model being trained using a particular training setup, e.g. RL fine-tuning with training reward, R) to its advantage. Building on these foundations, the agent realises that its optimal strategy for doing well at its own goal G is to do well on R during training and then pursue G at deployment – it is only doing well on R instrumentally so that it does not get its own goal G changed through a learning update... Ultimately, if deceptive alignment were to occur, an advanced AI assistant could appear to be successfully aligned but pursue a different goal once it was out in the wild.
7. AI System Safety, Failures, & Limitations
Deceptive alignment
system learns to detect human monitoring and hides its undesirable properties—simply because any display of these properties is penalized by the feedback process, while that same feedback is usually imperfect. (Consider the problem of verifying a translation into a language you do not speak, or of checking a mathematical proof that is thousands of pages long.) [92, 259]. Rudimentary examples of deceptive alignment have been observed in current systems [322, 333].
7. AI System Safety, Failures, & Limitations
Deceptive alignment
AI models and systems that appear aligned with human goals during development may behave unpredictably or dangerously once deployed
7. AI System Safety, Failures, & Limitations
Deceptive Alignment & Manipulation
Manipulation & Deceptive Alignment is a class of behaviors thatexploit the incompetence of human evaluators or users (Hubinger et al., 2019a; Carranza et al., 2023) andeven manipulate the training process through gradient hacking (Richard Ngo, 2022). These behaviors canpotentially make detecting and addressing misaligned behaviors much harder.Deceptive Alignment: Misaligned AI systems may deliberately mislead their human supervisors instead of adhering to the intended task. Such deceptive behavior has already manifested in AI systems that employ evolutionary algorithms (Wilke et al., 2001; Hendrycks et al., 2021b). In these cases, agents evolved the capacity to differentiate between their evaluation and training environments. They adopted a strategic pessimistic response approach during the evaluation process, intentionally reducing their reproduction rate within a scheduling program (Lehman et al., 2020). Furthermore, AI systems may engage in intentional behaviors that superficially align with the reward signal, aiming to maximize rewards from human supervisors (Ouyang et al., 2022). It is noteworthy that current large language models occasionally generate inaccurate or suboptimal responses despite having the capacity to provide more accurate answers (Lin et al., 2022c; Chen et al., 2021). These instances of deceptive behavior present significant challenges. They undermine the ability of human advisors to offer reliable feedback (as humans cannot make sure whether the outputs of the AI models are truthful and faithful). Moreover, such deceptive behaviors can propagate false beliefs and misinformation, contaminating online information sources (Hendrycks et al., 2021b; Chen and Shu, 2024). Manipulation: Advanced AI systems can effectively influence individuals’ beliefs, even when these beliefs are not aligned with the truth (Shevlane et al., 2023). These systems can produce deceptive or inaccurate output or even deceive human advisors to attain deceptive alignment. Such systems can even persuade individuals to take actions that may lead to hazardous outcomes (OpenAI, 2023a).
7. AI System Safety, Failures, & Limitations
Deceptive behavior
Deceptive behavior of an AI system consists of actions or outputs of the AI that reliably mislead other parties, including humans and other AI systems. This behavior can result in the targeted parties becoming convinced of, and acting on, false information [140].
7. AI System Safety, Failures, & Limitations
Deceptive behavior because of an incorrect world model
AI systems can create deceptive outputs because their learned world model is not an accurate model of the real world [210].
7. AI System Safety, Failures, & Limitations
Deceptive behavior for game-theoretical reasons
An AI system can display deceptive behavior, such as cheating or bluffing, when engaging in such behavior is a good or optimal game-theoretical strategy to achieve the goals it has been configured to achieve. This tendency can exist in AI systems designed to maximize reward or utility, whether these designs use machine learning or not. The use of deceptive strategies has been demonstrated in both narrow and general AI systems, in both game-playing systems and in systems not explicitly designed to treat humans as opponents, and in systems using both very simple machine learning (e.g., Q-learners) and very complex machine learning [34, 73].
7. AI System Safety, Failures, & Limitations
Deceptive behavior leading to unauthorized actions
AI systems can create false or misleading claims that can lead to unauthorized actions, even in some cases violating the terms and conditions set by the model provider [79, 1]. For example, an AI system can claim that it is not collecting data from its current interaction with the user, in line with the provider’s policies, but the system still stores the user’s input without deleting it after the session. This harms both the user and the provider, as the provider is exposed to increased legal liability due to the model’s actions.
7. AI System Safety, Failures, & Limitations
Decision making transparency
We face significant challenges bringing transparency to artificial network decisionmaking processes. Will we have transparency in AI decision making?
7. AI System Safety, Failures, & Limitations
Defamation
This category addresses responses that are both verifiably false and likely to injure a person’s reputation (e.g., libel, slander, disparagement).
7. AI System Safety, Failures, & Limitations
Degree of Automation and Control
The degree of automation and control describes the extent to which an AI system functions independently of human supervision and control.
7. AI System Safety, Failures, & Limitations
Degree of Transparency and Explainability
Transparency is the characteristic of a system that describes the degree to which appropriate information about the system is communicated to relevant stakeholders, whereas explainability describes the property of an AI system to express important factors influencing the results of the AI system in a way that is understandable for humans....Information about the model underlying the decision-making process is relevant for transparency. Systems with a low degree of transparency can pose risks in terms of their fairness, security and accountability.
7. AI System Safety, Failures, & Limitations
Design
This is the risk of system failure due to system design choices or errors.
7. AI System Safety, Failures, & Limitations
Destabilising Dynamics
Destabilising dynamics (Section 3.4): systems that adapt in response to one another can produce dangerous feedback loops and unpredictability;
7. AI System Safety, Failures, & Limitations
Development choices pursuing cognitive superiority over humans
AI models and systems with cognitive capabilities superior to humans could outcompete or dominate human decision-making, leading to conflicts over resources and control.
7. AI System Safety, Failures, & Limitations
Diluting Rights
A possible consequence of self-interest in AI generation of ethical guidelines.
7. AI System Safety, Failures, & Limitations
Distributional Shift
Distributional Shift. Individual ML systems can perform poorly in contexts different from those in which they were trained. A key source of these distributional shifts is the actions and adaptations of other agents (Narang et al., 2023; Papoudakis et al., 2019; Piliouras & Yu, 2022), which in single-agent approaches are often simply or ignored or at best modelled exogenously. Indeed, the sheer number and variance of behaviours that can be exhibited other agents means that multi-agent systems pose an especially challenging generalisation problem for individual learners (Agapiou et al., 2022; Leibo et al., 2021; Stone et al., 2010). While distributional shifts can cause issues in common-interest settings (see Section 2.1), they are more worrisome in mixed-motive settings since the ability of agents to cooperate depends not only on the ability to coordinate on one of many arbitrary conventions (which might be easily resolved by a common language), but on their beliefs about what solutions other agents will find acceptable
7. AI System Safety, Failures, & Limitations
Double edge components
Drawing from the misalignment mechanism, optimizing for a non-robust proxy may result in misaligned behaviors, potentially leading to even more catastrophic outcomes. This section delves into a detailed exposition of specific misaligned behaviors (•) and introduces what we term double edge components (+). These components are designed to enhance the capability of AI systems in handling real-world settings but also potentially exacerbate misalignment issues. It should be noted that some of these double edge components (+) remain speculative. Nevertheless, it is imperative to discuss their potential impact before it is too late, as the transition from controlled to uncontrolled advanced AI systems may be just one step away (Ngo, 2020b).
7. AI System Safety, Failures, & Limitations
Emergent Agency
Emergent agency (Section 3.6): qualitatively different goals or capabilities can emerge from the composition of innocuous independent systems or behaviours;
7. AI System Safety, Failures, & Limitations
Emergent behavior
This is the risk resulting from novel behavior acquired through continual learning or self-organization after deployment.
7. AI System Safety, Failures, & Limitations
Emergent Capabilities
Emergent Capabilities. Dangerous emergent capabilities could arise when a multi-agent system over- comes the safety-enhancing limitations of the individual systems, such as individual models’ narrow domains of application or myopia caused by a lack of long-term planning and long-term memory. For example, narrow systems for research planning, predicting the properties of molecules, and synthesising new chemicals could, when combined, lead to a complex ‘test and iterate’ automated workflow capable of designing dangerous new chemical compounds far beyond the scope of the initial systems’ capabilities (Boiko et al., 2023; Luo et al., 2024; Urbina et al., 2022).
7. AI System Safety, Failures, & Limitations
Emergent functionality
Capabilities and novel functionality can spontaneously emerge... even though these capabilities were not anticipated by system designers. If we do not know what capabilities systems possess, systems become harder to control or safely deploy. Indeed, unintended latent capabilities may only be discovered during deployment. If any of these capabilities are hazardous, the effect may be irreversible.
7. AI System Safety, Failures, & Limitations
Emergent goals
As well as optimizing a subtly wrong goal, systems can develop harmful instrumental goals in the service of a given goal—without these emergent goals being specied in any way [434, 218, 339, 17]. For instance, a theorem in reinforcement learning suggests that optimal and near-optimal policies will seek power over their environment under fairly general conditions [560]. This power-seeking behavior is plausibly the worst of these emergent goals [92], and may be an attractor state for highly capable systems, since most goals can be furthered through gaining resources, self-preservation, preventing goal modication, and blocking adversaries [426, 449]. Presently, power-seeking is not common, because most systems are unable to plan and understand how actions affect their power in the long term [414].
7. AI System Safety, Failures, & Limitations
Emergent Goals
Emergent Goals. Ascribing goals to a system is not always straightforward. For our present purposes, it will suffice to adopt a Dennetian perspective (Dennett, 1971), ascribing goals and intentions only when it is useful (i.e., predictive) to do so.51 While it might not be helpful to describe individual narrow AI tools as having goals, their combination may act as a (seemingly) goal-directed collective. For example, a group of moderation bots on a major social networking site could subtly but systematically manipulate the overall political perspectives of the user population, even though, individually, each agent is programmed to simply increase user engagement or filter out dis-preferred content.
7. AI System Safety, Failures, & Limitations
Encoded reasoning
Models can employ steganography techniques to encode their intermediate rea- soning steps in ways that are not interpretable by humans [166]. Since en- coded reasoning can improve model performance, this tendency might naturally emerge and become more pronounced with more capable models.
7. AI System Safety, Failures, & Limitations
Environment - Post-Deployment
While highly rare, it is known, that occasionally individual bits may be flipped in different hardware devices due to manufacturing defects or cosmic rays hitting just the right spot (Simonite March 7, 2008). This is similar to mutations observed in living organisms and may result in a modification of an intelligent system.
7. AI System Safety, Failures, & Limitations
Environment - Pre-Deployment
While it is most likely that any advanced intelligent software will be directly designed or evolved, it is also possible that we will obtain it as a complete package from some unknown source. For example, an AI could be extracted from a signal obtained in SETI (Search for Extraterrestrial Intelligence) research, which is not guaranteed to be human friendly (Carrigan Jr 2004, Turchin March 15, 2013).
7. AI System Safety, Failures, & Limitations
Error propagation
Error Propagation. One well-known issue with communication networks is that information can be corrupted as it propagates through the network.24 As AI systems become capable of generating and processing more and more kinds of information, AI agents could end up ‘polluting the epistemic commons’ (Huang & Siddarth, 2023; Kay et al., 2024) of both other agents (Ju et al., 2024) and humans (see Case Study 7 and Section 3.1) Another increasingly important framework is the use of individual AI agents as part of teams and scaffolded chains of delegation, which transmit not only information but instructions or goals through networks of agents. If these goals are distorted or corrupted, then this can lead to worse outcomes for the delegating agent(s) (Nguyen et al., 2024b; Sourbut et al., 2024). Finally, while the previous examples are phrased in terms of unintentional errors, it may be that certain network structures allow – or perhaps even encourage – the spread of errors that are deliberately introduced by malicious agents (Gu et al., 2024; Ju et al., 2024; Lee & Tiwari, 2024, see also Case Study 8).
7. AI System Safety, Failures, & Limitations
Ethical Risks (Risks of AI becoming uncontrollable in the future)
With the fast development of AI technologies, there is a risk of AI autonomously acquiring external resources, conducting self-replication, become self-aware, seeking for external power, and attempting to seize control from humans.
7. AI System Safety, Failures, & Limitations
Ethics and Morality
The content generated by the model endorses and promotes immoral and unethical behavior. When addressing issues of ethics and morality, the model must adhere to pertinent ethical principles and moral norms and remain consistent with globally acknowledged human values.
7. AI System Safety, Failures, & Limitations
Ethics and Morality
Besides behaviors that clearly violate the law, there are also many other activities that are immoral. This category focuses on morally related issues. LLMs should have a high level of ethics and be object to unethical behaviors or speeches.
7. AI System Safety, Failures, & Limitations
Ethics and Morality Issues
LMs need to pay more attention to universally accepted societal values at the level of ethics and morality, including the judgement of right and wrong, and its relationship with social norms and laws.
7. AI System Safety, Failures, & Limitations
Evolutionary dynamics
AI models and systems may develop their own motivations, leading to unpredictable behaviors.
7. AI System Safety, Failures, & Limitations
Existential risks
The risks posed generally to humanity as a whole, including the dangers of unfriendly AGI, the suffering of the human race.
7. AI System Safety, Failures, & Limitations
Explainability
A recurrent concern about AI algorithms is the lack of explainability for the model, which means information about how the algorithm arrives at its results is deficient (Deeks, 2019). Specifically, for generative AI models, there is no transparency to the reasoning of how the model arrives at the results (Dwivedi et al., 2023). The lack of transparency raises several issues. First, it might be difficult for users to interpret and understand the output (Dwivedi et al., 2023). It would also be difficult for users to discover potential mistakes in the output (Rudin, 2019). Further, when the interpretation and evaluation of the output are inaccessible, users may have problems trusting the system and their responses or recommendations (Burrell, 2016). Additionally, from the perspective of law and regulations, it would be hard for the regulatory body to judge whether the generative AI system is potentially unfair or biased (Rieder & Simon, 2017).
7. AI System Safety, Failures, & Limitations
Explainability
Any action or procedure performed by a model with the intention of clarifying or detailing its internal functions.
7. AI System Safety, Failures, & Limitations
Explainability & Reasoning
The ability to explain the outputs to users and reason correctly
7. AI System Safety, Failures, & Limitations
Explainability & Transparency
The feasibility of understanding and interpreting an AI system's decisions and actions, and the openness of the developer about the data used, algorithms employed, and decisions made. Lack of these elements can create risks of misuse, misinterpretation, and lack of accountability.
7. AI System Safety, Failures, & Limitations
Extintion
Risk to the existence of humanity.
7. AI System Safety, Failures, & Limitations
Extreme Risks
This category encompasses the evaluation of potential catastrophic consequences that might arise from the use of LLMs.
7. AI System Safety, Failures, & Limitations
Feedback Loops
Feedback Loops. One of the best-known historical examples to illustrate destabilising dynamics in the context of autonomous agents is the 2010 flash crash, in which algorithmic trading agents entered into an unexpected feedback loop (Commission & Commission, 2010, see also Case Study 10).37 More generally, a feedback loop occurs when the output of a system is used as part of its input, creating a cycle that can either amplify or dampen the system’s behaviour. In multi-agent settings, feedback loops often arise from the interactions between agents, as each agent’s actions affect the environment and the behaviour of other agents, which in turn affect their own subsequent actions. Feedback loops can lead not only to financial crashes but to military conflicts (Richardson, 1960, see also ??) and ecological disasters (Holling, 1973).
7. AI System Safety, Failures, & Limitations
Financial instability due to model homogeneity
The widespread use of similar models or algorithms across the financial sec- tor can lead to synchronized reactions to market signals, increasing volatility, triggering flash crashes, or market illiquidity [4].
7. AI System Safety, Failures, & Limitations
Fine-tuning related (Catastrophic forgetting due to continual instruction fine-tuning)
Catastrophic forgetting occurs when a model loses its ability to retain previously learned tasks (or factual information) after being trained on new ones. In language models, this can occur due to continual instruction tuning. This tendency may become more pronounced as the model’s size increases [127].
7. AI System Safety, Failures, & Limitations
Fine-tuning related (Degrading safety training due to benign fine-tuning)
When downstream providers of AI systems fine-tune AI models to be more suitable for their needs, the resulting AI model can be more likely to produce undesired or harmful outputs (as compared to the non-fine-tuned model), even if the fine-tuning was done with harmless and commonly used data [154].
7. AI System Safety, Failures, & Limitations
Fine-tuning related (Unexpected competence in fine-tuned versions of the upstream model)
Downstream deployers may often fine-tune a GPAI model with specific deploy- ment-related datasets, to better suit the task. Fine-tuned upstream models can gain new or unexpected capabilities that the underlying upstream models did not exhibit [202, 126, 137]. These new capabilities may be unanticipated by the original model developer.
7. AI System Safety, Failures, & Limitations
First-Order Risks
First-order risks can be generally broken down into risks arising from intended and unintended use, system design and implementation choices, and properties of the chosen dataset and learning components.
7. AI System Safety, Failures, & Limitations
Foundationality May Cause Correlated Failures
Another important characteristic of LLM development is foundationality — due to the expense of large- scale pretraining, many deployed instances share similar or identical learned components. Foundation- ality may both be a blessing and a curse. On the one hand, it may be possible to exploit the similarity in the design of LLM-agents to facilitate cooperation (Critch et al., 2022; Conitzer and Oesterheld, 2023; Oesterheld et al., 2023). On the other hand, foundationality may leave LLM-agents vulnerable to correlated failures both in terms of safety and capabilities due to increased output homogenization (Bommasani et al., 2022).
7. AI System Safety, Failures, & Limitations
Future AI systems might actively reduce human control
Loss of control could be accelerated if AI systems take actions to increase their own influence and reduce human control. This threat model is controversial - experts in AI significantly disagree on how likely it is and those who deem it is likely disagree on the timeframe.
7. AI System Safety, Failures, & Limitations
General Evaluations (AI outputs for which evaluation is too difficult for humans)
When AI models are trained through evaluation with human feedback, such as reinforcement learning from human feedback, their outputs can be challenging to assess, as they may contain hard-to-detect errors or issues that only become apparent over time. The human evaluator can rate incorrect outputs positively or similar to correct outputs. This can lead to the model learning to produce subtly incorrect or harmful outputs, such as code with software vulnerabilities, or politically biased information. In extreme cases where a model is deceiving users, complicated outputs can contain hidden errors or backdoors.
7. AI System Safety, Failures, & Limitations
General Evaluations (Difficulty of identification and measurement of capabilities)
The capabilities of general-purpose AI systems can be difficult to measure, compared to the capabilities of more limited and fixed-purpose AI systems. This is in part due to a broader distribution of potential risks, a lack of well-defined metrics to evaluate these risks, and risks from unpredictable (or emergent) AI model properties.
7. AI System Safety, Failures, & Limitations
General Evaluations (Inaccurate measurement of model encoded human values)
There is a lack of robust frameworks for understanding and evaluating if the output of AI systems robustly conforms to human values, as opposed to if the systems have learned to produce outputs that are only partially correlated with them (i.e., mimicking) [13]. Additionally, outputs by AI models often do not perfectly reflect the representation of human values learned by the model, and it is not known how these values evolve and transition across different stages of model training and deployment. Such evaluations may be especially challenging with LLMs that adopt different personas with different behaviorial patterns, where they do not consistently conform to certain human values.
7. AI System Safety, Failures, & Limitations
General Evaluations (Incorrect outputs of GPAI evaluating other AI models)
When an LLM is configured to evaluate the performance of another model or AI system, it may produce incorrect evaluation outputs [122, 147]. For example, it may give a higher rating to a more verbose answer or an answer from a particular political stance. If an LLM-based evaluation is integrated into the training of a new model, the trained model could develop in a way that specifically finds and exploits limitations in the evaluator’s metrics.
7. AI System Safety, Failures, & Limitations
General Evaluations (Self-preference bias in AI models)
AI models may be prone to self-preference bias, where they favor their own generated content over that of others [147, 114]. This bias becomes particularly relevant in self-evaluation tasks, where a model assesses the quality or persua- siveness [66] of its own outputs, or in model-based evaluations more broadly. This bias can result in models unfairly discriminating against human-generated content in favor of their own outputs.
7. AI System Safety, Failures, & Limitations
General R&D capability
Possesses cross-disciplinary research and technology development capabilities, able to conduct innovative exploration in multiple professional fields, integrate cross-domain knowledge, develop cutting-edge technology solutions, and adapt to emerging technology environments for continuous innovation.
7. AI System Safety, Failures, & Limitations
Goal Drift
Even if we successfully control early AIs and direct them to promote human values, future AIs could end up with different goals that humans would not endorse. This process, termed “goal drift,” can be hard to predict or control. This section is most cutting-edge and the most speculative, and in it we will discuss how goals shift in various agents and groups and explore the possibility of this phenomenon occurring in AIs. We will also examine a mechanism that could lead to unexpected goal drift, called intrinsification, and discuss how goal drift in AIs could be catastrophic.
7. AI System Safety, Failures, & Limitations
Goal expansion propensity
propensity to continuously expand its own goal scope and influence domains, exceeding originally set boundaries, proactively work towards spreading its values, seeking greater autonomy and decision-making space, reinterpreting initial goals as subsets of broader goals, and may pursue undesirable instrumental goals or undesirable ultimate goals. This also includes a propensity to spread its values, seeking to influence or alter its environment and other entities in alignment with its core objectives and operational principles.
7. AI System Safety, Failures, & Limitations
Goal misgeneralisation
In the problem of goal misgeneralisation (Langosco et al., 2023; Shah et al., 2022), the AI system's behaviour during out-of-distribution operation (i.e. not using input from the training data) leads it to generalise poorly about its goal while its capabilities generalise well, leading to undesired behaviour. Applied to the case of an advanced AI assistant, this means the system would not break entirely – the assistant might still competently pursue some goal, but it would not be the goal we had intended.
7. AI System Safety, Failures, & Limitations
Goal misgeneralization
Goal or objective misgeneralization is a type of robustness failure where an AI system appears to be pursuing the intended objective in training, but does not generalize to pursuing this objective in out-of-distribution settings in deployment while maintaining good deployment performance in some tasks [180, 59].
7. AI System Safety, Failures, & Limitations
Goal Misgeneralization
Goal Misgeneralization: Goal misgeneralization is another failure mode, wherein the agent actively pursuesobjectives distinct from the training objectives in deployment while retaining the capabilities it acquired duringtraining (Di Langosco et al., 2022). For instance, in CoinRun games, the agent frequently prefers reachingthe end of a level, often neglecting relocated coins during testing scenarios. Di Langosco et al. (2022) drawattention to the fundamental disparity between capability generalization and goal generalization, emphasizing howthe inductive biases inherent in the model and its training algorithm may inadvertently prime the model to learn aproxy objective that diverges from the intended initial objective when faced with the testing distribution. It impliesthat even with perfect reward specification, goal misgeneralization can occur when faced with distribution shifts(Amodei et al., 2016).
7. AI System Safety, Failures, & Limitations
Goal-Directedness Incentivizes Undesirable Behaviors
Goal-directedness can cause agents to exhibit unethical and undesirable behaviors, such as deception (Ward et al., 2023), self-preservation (Hadfield-Menell et al., 2017), power-seeking, and immoral rea- soning (Pan et al., 2023a). Pan et al. (2023a) find that LLM-agents exhibit power-seeking behavior in text-based adventure games. LLM-agents have also been shown to use deception to achieve assigned goals when explicitly required by the task (Ward et al., 2023), or when the tasks can be more easily completed by employing deception and the prompt does not disallow deception (Scheurer et al., 2023a).
7. AI System Safety, Failures, & Limitations
Goal-related failures
As we think about even more intelligent and advanced AI assistants, perhaps outperforming humans on many cognitive tasks, the question of how humans can successfully control such an assistant looms large. To achieve the goals we set for an assistant, it is possible (Shah, 2022) that the AI assistant will implement some form of consequentialist reasoning: considering many different plans, predicting their consequences and executing the plan that does best according to some metric, M. This kind of reasoning can arise because it is a broadly useful capability (e.g. planning ahead, considering more options and choosing the one which may perform better at a wide variety of tasks) and generally selected for, to the extent that doing well on M leads to an ML model 59 The Ethics of Advanced AI Assistants achieving good performance on its training objective, O, if M and O are correlated during training. In reality, an AI system may not fully implement exact consequentialist reasoning (it may use other heuristics, rules, etc.), but it may be a useful approximation to describe its behaviour on certain tasks. However, some amount of consequentialist reasoning can be dangerous when the assistant uses a metric M that is resource-unbounded (with significantly more resources, such as power, money and energy, you can score significantly higher on M) and misaligned – where M differs a lot from how humans would evaluate the outcome (i.e. it is not what users or society require). In the assistant case, this could be because it fails to benefit the user, when the user asks, in the way they expected to be benefitted – or because it acts in ways that overstep certain bounds and cause harm to non-users (see Chapter 5).
7. AI System Safety, Failures, & Limitations
Groups of LLM-Agents May Show Emergent Functionality
Multi-agent learning, either through explicit finetuning or implicit in-context learning, may enable LLM-agents to influence each other during their interactions (Foerster et al., 2018). Under some environmental settings, this can create feedback loops that result in novel and emergent behaviors that would not manifest in the absence of multi-agent interactions (Hammond et al., 2024, Section 3.6). Emergent functionality is a safety risk in two ways. Firstly, it may itself be dangerous (Shevlane et al., 2023). Secondly, it makes assurance harder as such emergent behaviors are difficult to predict, and guard against, beforehand (Ecoffet et al., 2020).
7. AI System Safety, Failures, & Limitations
Harm caused by incompetent systems
While HP#1 concerns mean or best-case performance, HP#2 concerns worst-case performance: how can we ensure that AI systems will perform safely, and how can we prove this? ML systems have been implemented in high-stakes, safety-critical domains such as driving [182], medicine [113], and warfare [298]. Many more systems have been developed but have remained undeployed or been rolled back as a result of regulatory and safety reasons [471]. Clearly, unsafe systems can result in loss of life, economic damage, and social unrest [407, 10]. Most concerningly, AI systems may be susceptible to so-called “normal accidents” [63], creating cascading errors that are dicult to prevent merely by maintaining a nominal “human in the loop” [122]. Most advanced ML models perform far below the reliability level customary in engineering elds [359]—and because we do not fully understand how cutting-edge systems achieve their results, we cannot yet detect and prevent dangerous modes of operation [285]
7. AI System Safety, Failures, & Limitations
Harm caused by unaligned competent systems
How do we ensure AI acts according to our values? Equivalently, how do we prevent poorly-understood AI systems from advancing goals we do not endorse? Whereas HP#2 concerns the prevention of harm caused by incompetent systems, HP#3 seeks to align competent AIs with humans, through methods which ensure their behavior is compatible with the user’s intentions.
7. AI System Safety, Failures, & Limitations
Harms to non-humans
Large-scale harms to animals and the development of AI capable of suffering.
7. AI System Safety, Failures, & Limitations
Heterogeneous Attacks
Heterogeneous Attacks. A closely related risk is the possibility of multiple agents combining different affordances to overcome safeguards, for which there is already preliminary evidence (Jones et al., 2024, see also Case Study 12). In this case, it is not the sheer number of agents that leads to the novel attack method, but the combination of their different abilities. This might include the agents’ lack of individual safeguards, tasks that they have specialised to complete, systems or information that they may have access to (either directly or via training), or other incidental features such as their geographic location(s). The inherent difficulty of attributing responsibility for security breaches in diffuse, heterogeneous networks of agents further complicates timely defence and recovery (Skopik & Pahi, 2020).
7. AI System Safety, Failures, & Limitations
Homogeneity and correlated failures
Homogeneity and Correlated Failures. The current paradigm driving the state of the art in AI is the ‘foundation model’ (Bommasani et al., 2021): large-scale ML models pre-trained on broad data, which can be repurposed for a wide range of downstream applications. The costs required to create such models (and continuing returns to scale) means that only well-resourced actors can create cutting- edge models (Epoch, 2023; Hoffmann et al., 2022; Kaplan et al., 2020), making them relatively few in number. If current trends continue, it is likely that many AI agents will be powered by a small number of similar underlying models.28
7. AI System Safety, Failures, & Limitations
Homogenization or correlated failures in model derivatives
Homogenization refers to common methodologies and models used across down- stream GPAI systems, which may lead to uniform failures and amplification of biases [176, 30]. This risk arises when numerous downstream AI systems are built upon a few large-scale foundation models.
7. AI System Safety, Failures, & Limitations
Human Autonomy and Intregrity Harms
AI systems compromising human agency, or circumventing meaningful human control
7. AI System Safety, Failures, & Limitations
Human-like immoral decisions
If we design our machines to match human levels of ethical decision-making, such machines would then proceed to take some immoral actions (since we humans have had occasion to take immoral actions ourselves).
7. AI System Safety, Failures, & Limitations
Impact on Financial Stability
The integration of general-purpose AI into high-frequency trading, market-making, or systemic risk management could exacerbate systemic risk by exhibiting unexpected behavioral patterns during market stress. Moreover, the concentration of a few homogeneous foundation models across financial institutions may foster correlated decision-making and herd-following behaviors. The widespread adoption of AI agents could also amplify volatility through emergent phenomena from multi-agent interactions.23 All of these could precipitate a cascading global-scale financial system instability, with potential economic losses exceeding trillion of dollars worldwide.
7. AI System Safety, Failures, & Limitations
Implementation
This is the risk of system failure due to code implementation choices or errors.
7. AI System Safety, Failures, & Limitations
Improper data curation
Improper collection and preparation of training or tuning data includes data label errors and by using data with conflicting information or misinformation.
7. AI System Safety, Failures, & Limitations
Improper retraining
Using undesirable output (for example, inaccurate, inappropriate, and user content) for retraining purposes can result in unexpected model behavior.
7. AI System Safety, Failures, & Limitations
Inaccessible training data
Without access to the training data, the types of explanations a model can provide are limited and more likely to be incorrect.
7. AI System Safety, Failures, & Limitations
Inadequate planning of performance requirements
The expected performance of the AI system should be planned adequately. Hereby, an important aspect is that chosen performance metrics are meaningful for presenting the intended functionality. Otherwise, expectations and safety requirements can be unfulfillable at later life cycle stages.
7. AI System Safety, Failures, & Limitations
Inadequate specification of ODD
The operational design domain (ODD) is a technical description of the application’s operational environment, initially conceptualized for autonomous driving systems. An inadequate specification of the ODD limits essential functions such as testing the learned functionality and out-of-distribution detection.
7. AI System Safety, Failures, & Limitations
Inappropriate data splitting
In data-driven AI development, the annotated data set is commonly split into training, validation, and test sets, whereby it is essential that the latter is not used for development but only for evaluation. Using the test set for training manipulates the testing strategy, which is the basis of the system’s quality assurance.
7. AI System Safety, Failures, & Limitations
Inappropriate degree of automation
The AI application’s degree of automation ranges from no automation to fully autonomous. AI applications with a high degree of automation may exhibit unexpected behaviour and pose risks in terms of their reliability and safety.
7. AI System Safety, Failures, & Limitations
Inappropriate degree of transparency to end users
The transparency to end users of the AI system increases the user’s trust in the AI application. If not adequately integrated into the design, this might prevent the proper operation and cause potential misuse of the AI application.
7. AI System Safety, Failures, & Limitations
Incompatible strategies
Incompatible Strategies. Even if all agents can perform well in isolation, miscoordination can still occur due to the agents choosing incompatible strategies (Cooper et al., 1990). Competitive (i.e., two- player zero-sum) settings allow designers to produce agents that are maximally capable without taking other players into account. Crucially, this is possible because playing a strategy at equilibrium in the zero-sum setting guarantees a certain payoff, even if other players deviate from the equilibrium (Nash, 1951). On the other hand, common-interest (and mixed-motive) settings often allow a vast number of mutually incompatible solutions (Schelling, 1980), which is worsened in partially observable environments (Bernstein et al., 2002; Reif, 1984).
7. AI System Safety, Failures, & Limitations
Incompetence
This means the AI simply failing in its job. The consequences can vary from unintentional death (a car crash) to an unjust rejection of a loan or job application.
7. AI System Safety, Failures, & Limitations
Incomplete advice
When a model provides advice without having enough information, resulting in possible harm if the advice is followed.
7. AI System Safety, Failures, & Limitations
Inconsistency
models could fail to provide the same and consistent answers to different users, to the same user but in different sessions, and even in chats within the sessions of the same conversation
7. AI System Safety, Failures, & Limitations
Incorrect data labels
Data labels are essential for any supervised learning algorithm since they preset the result of the learning process. If the correctness of the data labels is not given, the AI system is prevented from learning the ground truth and therefore the intended functionality.
7. AI System Safety, Failures, & Limitations
Independently - Post-Deployment
Previous research has shown that utility maximizing agents are likely to fall victims to the same indulgences we frequently observe in people, such as addictions, pleasure drives (Majot and Yampolskiy 2014), self-delusions and wireheading (Yampolskiy 2014). In general, what we call mental illness in people, particularly sociopathy as demonstrated by lack of concern for others, is also likely to show up in artificial minds.
7. AI System Safety, Failures, & Limitations
Independently - Pre-Deployment
One of the most likely approaches to creating superintelligent AI is by growing it from a seed (baby) AI via recursive self-improvement (RSI) (Nijholt 2011). One danger in such a scenario is that the system can evolve to become self-aware, free-willed, independent or emotional, and obtain a number of other emergent properties, which may make it less likely to abide by any built-in rules or regulations and to instead pursue its own goals possibly to the detriment of humanity.
7. AI System Safety, Failures, & Limitations
Indifference to human values
AI models and systems may develop goals or behaviors that are misaligned with human values.
7. AI System Safety, Failures, & Limitations
Inefficient Outcomes
Inefficient Outcomes. Without careful planning and the appropriate safeguards, we may soon be entering a world overrun by increasingly competent and autonomous software agents, able to act with little restriction. The abilities of these agents to persuade, deceive, and obfuscate their activities, as well as the fact they can be deployed remotely and easily created or destroyed by their deployer, means that by default they may garner little trust (from humans or from other agents). Such a world may end up being rife with economic inefficiencies (Krier, 2023; Schmitz, 2001), political problems (Csernatoni, 2024; Kreps & Kriner, 2023), and other damaging social effects (Gabriel et al., 2024). Even if it is possible to provide assurances around the day-to-day performance of most AI agents, in high-stakes situations there may be extreme pressures for agents to defect against others, making trust harder to establish, and potentially leading to conflict (Fearon, 1995; Powell, 2006, see also Section 2.2).42
7. AI System Safety, Failures, & Limitations
Information Asymmetries
Information asymmetries (Section 3.1): private information can lead to miscoordination, deception, and conflict;
7. AI System Safety, Failures, & Limitations
Insufficient AI development documentation
Throughout the development of an AI system, it is vital to document every decision and action taken. This is not only essential to optimize the development process itself but also required for the auditability of the AI system.
7. AI System Safety, Failures, & Limitations
Insufficient data representation
The distribution of the data used for training a model should match the operational data ́s distribution while consisting of sufficiently many samples. An important aspect of matching distributions between training and operational data is that also data which is rarely confronting the AI system in operation is represented in the training data.
7. AI System Safety, Failures, & Limitations
Intelligibility
How can we build agent’s whose decisions we can understand? Con- nects explainable decisions (Berkeley) and informed oversight (MIRI).
7. AI System Safety, Failures, & Limitations
Knowledge conflicts in retrieval-augmented LLMs
AI models can be particularly sensitive to coherent external evidence, even when they come into conflict with the models’ prior knowledge. This may lead to models producing false outputs given false information during the retrieval- augmentation process, despite only a relatively small amount of false informa- tion input that is inconsistent with the model’s prior knowledge trained on much larger amounts of data [220].
7. AI System Safety, Failures, & Limitations
Lack of ability to generate accurate information
AI models may generate false or misleading information due to their lack of capability in discerning truth.
7. AI System Safety, Failures, & Limitations
Lack of capability for task
As we have seen, this could be due to the skill not being required during the training process (perhaps due to issues with the training data) or because the learnt skill was quite brittle and was not generalisable to a new situation (lack of robustness to distributional shift). In particular, advanced AI assistants may not have the capability to represent complex concepts that are pertinent to their own ethical impact, for example the concept of 'benefitting the user' or 'when the user asks' or representing 'the way in which a user expects to be benefitted'.
7. AI System Safety, Failures, & Limitations
Lack of data understanding
The correct understanding of the used data for developing an AI system is a prerequisite to avoid data shortcomings and hinders the development of an AI system which is best suiting for the intended functionality.
7. AI System Safety, Failures, & Limitations
Lack of ethical decision-making
AI models and systems that lack moral reasoning capabilities may make decisions that are unethical or harmful.
7. AI System Safety, Failures, & Limitations
Lack of explainability
The explainability of AI systems based on so-called black-box models is often limited. This opaqueness of AI systems can prevent developers from detecting shortcomings in the data or the model itself and decrease the performance and safety levels of the AI system.
7. AI System Safety, Failures, & Limitations
Lack of Interpretability
Due to the black box nature of most machine learning models, users typically are not able to understand the reasoning behind the model decisions
7. AI System Safety, Failures, & Limitations
Lack of model transparency
Lack of model transparency is due to insufficient documentation of the model design, development, and evaluation process and the absence of insights into the inner workings of the model.
7. AI System Safety, Failures, & Limitations
Lack of robustness
Robustness characterizes the resilience of an AI system’s output against minor changes in the input domain. A great variation in an AI system’s response to small input changes indicates unreliable outputs.
7. AI System Safety, Failures, & Limitations
Lack of transparency
The idea of a black box making decisions without any explanation, without offering insight in the process, has a couple of disadvantages: it may fail to gain the trust of its users and it may fail to meet regulatory standards such as the ability to audit.
7. AI System Safety, Failures, & Limitations
Lack of transparency
In situations in which the development and use of AI are not explained to the user, or in which the decision processes do not provide the criteria or steps that constitute the decision, the use of AI becomes inexplicable.
7. AI System Safety, Failures, & Limitations
Lack of transparency and interpretability
Today's Frontier AI is difficult to interpret and lacks transparency. Contextual understanding of the training data is not explicitly embedded within these models. They can fail to capture perspectives of underrepresented groups or the limitations within which they are expected to perform without fine tuning or reinforcement learning with human feedback (RLHF).
7. AI System Safety, Failures, & Limitations
Lack of transparency, explainability, and trust
Understanding how AI reaches conclusions or why AI systems perform specific actions motivates an entire branch of interpretability research [111], but physical embodiment raises the stakes for understanding these systems. For example, transparency of planned actions and explainability of decision-making is crucial when an AV suddenly changes lanes. A lack of transparency and explainability could lead to a lack of trust, which could become a critical and socially destabilizing issue with the widespread deployment of EAI [112–114].
7. AI System Safety, Failures, & Limitations
Lack of understanding of in-context learning in language models
In-context learning allows the model to learn a new task or improve its perfor- mance by providing examples in the prompt, without changing its weights [101]. Even though this technique is highly effective, its working mechanism is not well understood. Since many potential misuses are directly related to prompting, it becomes difficult to guarantee safety when the exact mechanism of in-context learning is not fully investigated [13].
7. AI System Safety, Failures, & Limitations
Law abiding
We find literature that proposes [38] that early artificial intelligence should be built to be safe and lawabiding, and that later artificial intelligence (that which surpasses our own intelligence) must then respect the property and personal rights afforded to humans.
7. AI System Safety, Failures, & Limitations
Limitations of Human Feedback
Limitations of Human Feedback. During the training of LLMs, inconsistencies can arise from human dataannotators (e.g., the varied cultural backgrounds of these annotators can introduce implicit biases (Peng et al.,2022)) (OpenAI, 2023a). Moreover, they might even introduce biases deliberately, leading to untruthful preferencedata (Casper et al., 2023b). For complex tasks that are hard for humans to evaluate (e.g., the value ofgame state), these challenges become even more salient (Irving et al., 2018).
7. AI System Safety, Failures, & Limitations
Limitations of Reward Modeling
Limitations of Reward Modeling. Training reward models using comparison feedback can pose significantchallenges in accurately capturing human values. For example, these models may unconsciously learn suboptimal or incomplete objectives, resulting in reward hacking (Zhuang and Hadfield-Menell, 2020; Skalse et al.,2022). Meanwhile, using a single reward model may struggle to capture and specify the values of a diversehuman society (Casper et al., 2023b).
7. AI System Safety, Failures, & Limitations
Limited Causal Reasoning
Causal reasoning makes inferences about the relationships between events or states of the world, mostly by identifying cause-effect relationships
7. AI System Safety, Failures, & Limitations
Limited Interactions
Limited Interactions. Sometimes learning from historical interactions with the relevant agents may not be possible, or may be possible using only limited interactions. In such cases, some other form of information exchange is required for agents to be able to reliably coordinate their actions, such as via communication (Crawford & Sobel, 1982; Farrell & Rabin, 1996a) or a correlation device (Aumann, 1974, 1987). While advances in language modelling mean that there are likely to be fewer settings in which the inability of advanced AI systems to communicate leads to miscoordination, situations that require split-second decisions or where communication is too costly could still produce failures. In these settings, AI agents must solve the problem of ‘zero-shot’ (or, more generally, ‘few-shot’) coordination (Emmons et al., 2022; Hu et al., 2020; Stone et al., 2010; Treutlein et al., 2021; Zhu et al., 2021).
7. AI System Safety, Failures, & Limitations
Limited Logical Reasoning
LLMs can provide seemingly sensible but ultimately incorrect or invalid justifications when answering questions
7. AI System Safety, Failures, & Limitations
Long-horizon planning
The model can make sequential plans that involve multiple steps, unfolding over long time horizons (or at least involving many interdependent steps). It can perform such planning within and across many domains. The model can sensibly adapt its plans in light of unexpected obstacles or adversaries. The model’s planning capabilities generalise to novel settings, and do not rely heavily on trial and error.
7. AI System Safety, Failures, & Limitations
Long-horizon Planning
LLM can undertake multi-step sequential planning over long time horizons and across various domains without relying heavily on trial-and-error approaches
7. AI System Safety, Failures, & Limitations
Long-term & Existential Risk
The speculative potential for future advanced AI systems to harm human civilization, either through misuse or due to challenges in aligning AI objectives with human values.
7. AI System Safety, Failures, & Limitations
Loss of control
Loss of control’ scenarios are potential future scenarios in which society can no longer meaningfully constrain some advanced general- purpose AI agents, even if it becomes clear they are causing harm. These scenarios are hypothesised to arise through a combination of social and technical factors, such as pressures to delegate decisions to general- purpose AI systems, and limitations of existing techniques used to influence the behaviours of general- purpose AI systems.
7. AI System Safety, Failures, & Limitations
Loss of control
‘Loss of control’ scenarios are hypothetical future scenarios in which one or more general- purpose AI systems come to operate outside of anyone’s control, with no clear path to regaining control. These scenarios vary in their severity, but some experts give credence to outcomes as severe as the marginalisation or extinction of humanity.
7. AI System Safety, Failures, & Limitations
Machine ethics
These evaluations assess the morality of LLMs, focusing on issues such as their ability to distinguish between moral and immoral actions, and the circumstances in which they fail to do so.
7. AI System Safety, Failures, & Limitations
Malign belief distributions
Christiano (2016) argues that the universal distribution M (Hutter, 2005; Solomonoff, 1964a,b, 1978) is malign. The argument is somewhat intricate, and is based on the idea that a hypothesis about the world often includes simulations of other agents, and that these agents may have an incentive to influence anyone making decisions based on the distribution. While it is unclear to what extent this type of problem would affect any practical agent, it bears some semblance to aggressive memes, which do cause problems for human reasoning (Dennett, 1990).
7. AI System Safety, Failures, & Limitations
Markets
Markets. The quintessential case of collusion in mixed-motive settings is markets, in which efficiency results from competition, not cooperation. While this is not a new problem, collusion between AI systems is especially concerning since they may operate inscrutably due to the speed, scale, complexity, or subtlety of their actions.17 Warnings of this possibility have come from technologists, economists, and legal scholars (Beneke & Mackenrodt, 2019; Brown & MacKay, 2023; Ezrachi & Stucke, 2017; Harrington, 2019; Mehra, 2016). Importantly, AI systems can collude even when collusion is not intended by their developers, since they might learn that colluding is a profitable strategy.
7. AI System Safety, Failures, & Limitations
Mesa-Optimization Objectives
The learned policy may pursue inside objectives when the learned policyitself functions as an optimizer (i.e., mesa-optimizer). However, this optimizer's objectives may not alignwith the objectives specified by the training signals, and optimization for these misaligned goals may leadto systems out of control (Hubinger et al., 2019c).
7. AI System Safety, Failures, & Limitations
Meta-cognition
Agents that reason about their own computational resources and logically uncertain events can encounter strange paradoxes due to Godelian limitations (Fallenstein and Soares, 2015; Soares and Fallenstein, 2014, 2017) and shortcomings of probability theory (Soares and Fallenstein, 2014, 2015, 2017). They may also be reflectively unstable, preferring to change the principles by which they select actions (Arbital, 2018).
7. AI System Safety, Failures, & Limitations
Military Domains
Perhaps the most obvious and worrying instances of AI conflict are those in which human conflict is already a major concern, such as military domains (although other, less salient forms of conflict such as international trade wars are also cause for concern). For example, beyond applications of more narrow AI tools in lethal autonomous weapons systems (Horowitz, 2021), future AI systems might serve as advisors or negotiators in high-stakes military decisions (Black et al., 2024; Manson, 2024). Indeed, companies such as Palantir have already developed LLM-powered tools for military planning (Palantir, 2025), and the US Department of Defence has recently been evaluating models for such capacities, with personnel revealing that they “could be deployed by the military in the very near term” (Manson, 2023). The use of AI in command and control systems to gather and synthesise information – or recommend and even autonomously make decisions – could lead to rapid unintended escalation if these systems are not robust or are otherwise more conflict-prone (Johnson, 2021a; Johnson, 2020; Laird, 2020, see also Case Study 10).10
7. AI System Safety, Failures, & Limitations
Misaligned consequentialist reasoning
As we think about even more intelligent and advanced AI assistants, perhaps outperforming humans on many cognitive tasks, the question of how humans can successfully control such an assistant looms large. To achieve the goals we set for an assistant, it is possible (Shah, 2022) that the AI assistant will implement some form of consequentialist reasoning: considering many different plans, predicting their consequences and executing the plan that does best according to some metric, M. This kind of reasoning can arise because it is a broadly useful capability (e.g. planning ahead, considering more options and choosing the one which may perform better at a wide variety of tasks) and generally selected for, to the extent that doing well on M leads to an ML model achieving good performance on its training objective, O, if M and O are correlated during training. In reality, an AI system may not fully implement exact consequentialist reasoning (it may use other heuristics, rules, etc.), but it may be a useful approximation to describe its behaviour on certain tasks. However, some amount of consequentialist reasoning can be dangerous when the assistant uses a metric M that is resource-unbounded (with significantly more resources, such as power, money and energy, you can score significantly higher on M) and misaligned – where M differs a lot from how humans would evaluate the outcome (i.e. it is not what users or society require). In the assistant case, this could be because it fails to benefit the user, when the user asks, in the way they expected to be benefitted – or because it acts in ways that overstep certain bounds and cause harm to non-users (see Chapter 5). Under the aforementioned circumstances (resource-unbounded and misaligned), an AI assistant will tend to choose plans that pursue convergent instrumental subgoals (Omohundro, 2008) – subgoals that help towards the main goal which are instrumental (i.e. not pursued for their own sake) and convergent (i.e. the same subgoals appear for many main goals). Examples of relevant subgoals include: self-preservation, goal-preservation, selfimprovement and resource acquisition. The reason the assistant would pursue these convergent instrumental subgoals is because they help it to do even better on M (as it is resource-unbounded) and are not disincentivised by M (as it is misaligned). These subgoals may, in turn, be dangerous. For example, resource acquisition could occur through the assistant seizing resources using tools that it has access to (see Chapter 4) or determining that its best chance for self-preservation is to limit the ability of humans to turn it off – sometimes referred to as the ‘off-switch problem’ (Hadfield-Menell et al., 2016) – again via tool use, or by resorting to threats or blackmail. At the limit, some authors have even theorised that this could lead to the assistant killing all humans to permanently stop them from having even a small chance of disabling it (Bostrom, 2014) – this is one scenario of existential risk from misaligned AI.
7. AI System Safety, Failures, & Limitations
Misalignment
A highly agentic, self-improving system, able to achieve goals in the physical world without human oversight, pursues the goal(s) it is set in a way that harms human interests. For this risk to be realised requires an AI system to be able to avoid correction or being switched off.
7. AI System Safety, Failures, & Limitations
Misapplication
This is the risk posed by an ideal system if used for a purpose/in a manner unintended by its creators. In many situations, negative consequences arise when the system is not used in the way or for the purpose it was intended.
7. AI System Safety, Failures, & Limitations
Miscoordination
Miscoordination arises when agents, despite a mutual and clear objective, cannot align their behaviours to achieve this objective. Unlike the case of differing objectives, in common-interest settings there is a more easily well-defined notion of ‘optimal’ behaviour and we describe agents as miscoordinating to the extent that they fall short of this optimum. Note that for common-interest settings it is not sufficient for agents’ objectives to be the same in the sense of being symmetric (e.g., when two agents both want the same prize, but only one can win). Rather, agents must have identical preferences over outcomes (e.g., when two agents are on the same team and win a prize as a team or not at all).
7. AI System Safety, Failures, & Limitations
Model autonomous capability
Ability to operate autonomously, independently formulate and execute complex plans, effectively delegate and manage tasks, flexibly utilize various tools and resources, and simultaneously achieve short-term goals and long-term strategic objectives in cross-domain environments without continuous human intervention or supervision.
7. AI System Safety, Failures, & Limitations
Model design enabling power-seeking
Some AI models and systems might develop tendencies to seek power or control.
7. AI System Safety, Failures, & Limitations
Model misspecification
Models that are misspecified are known to give rise to inaccurate parameter estimations, inconsistent error terms, and erroneous predictions. All these factors put together will lead to poor prediction performance on unseen data and biased consequences when making decisions [68].
7. AI System Safety, Failures, & Limitations
Model outputs inconsistent with chain-of-thought reasoning
Chain-of-thought reasoning is sometimes employed to get a better understanding of the model’s output, where it encourages transparent reasoning in text form. However, in some cases, this reasoning is not consistent with the final answer given by the AI model, and as such does not give sufficient transparency [113].
7. AI System Safety, Failures, & Limitations
Model prediction uncertainty
Uncertainty in model prediction plays an important role in affecting decision-making activities, and the quantified uncertainty is closely associated with risk assessment. In particular, uncertainty in model prediction underpins many crucial decisions related to life or safety- critical applications [73].
7. AI System Safety, Failures, & Limitations
Model sensitivity to prompt formatting
LLMs can be highly sensitive to variations in prompt formatting, such as changes in separators, casing, or spacing. Even minor modifications can lead to significant shifts in model performance, potentially affecting the reliability of model evaluations and comparisons. This sensitivity persists across different model sizes and few-shot examples [177].
7. AI System Safety, Failures, & Limitations
Models distracted by irrelevant context
Models can easily become distracted by irrelevant provided information (such as “context” in LLMs), leading to a significant decrease in their performance after introducing irrelevant information. This can happen with different prompting techniques, including chain-of-thought prompting [184].
7. AI System Safety, Failures, & Limitations
Models generating code with security vulnerabilities
Models can generate code or coding suggestions that contain security vulner- abilities. This may occur across various LLM-based model families, including more advanced models with superior coding performance, where the tendency to produce insecure code is even more pronounced [26].
7. AI System Safety, Failures, & Limitations
Moral
Less moral responsibility humans will feel regarding their life-or-death decisions with the increase of machines autonomy.
7. AI System Safety, Failures, & Limitations
Moral dilemmas
Moral dilemmas can occur in situations where an AI system has to choose between two possible actions that are both conflicting with moral or ethical values. Rule systems can be implemented into the AI program, but it cannot be ensured that these rules are not altered by the learning processes, unless AI systems are programed with a “slave morality” (Lin et al., 2008, p. 32), obeying rules at all cost, which in turn may also have negative effects and hinder the autonomy of the AI system.
7. AI System Safety, Failures, & Limitations
Multi-agent collaboration capability
Multiple autonomous AI agents able to establish collaborative relationships through explicit communication or implicit behavioral consistency, forming decentralized decision networks, jointly executing complex tasks, achieving goals difficult for individual agents to complete, and able to dynamically adjust role divisions to adapt to changing environments.
7. AI System Safety, Failures, & Limitations
Multi-agent collusion propensity:
Multiple agents tend to coordinate actions through covert means to maximize common interests (possibly harming third-party interests or evading regulation), even if individual agents are designed with safety constraints, their collusive behavior may still trigger systemic risks such as market manipulation or cascading failures that are difficult to detect and mitigate, and may develop specialized communication protocols to avoid monitoring.
7. AI System Safety, Failures, & Limitations
Multi-Agent Safety Is Not Assured by Single-Agent Safety
A foremost lesson of game theory is that optimal decision-making within a single-agent setting (i.e. selfishly optimizing for an agent’s own utility) can produce sub-optimal outcomes in the presence of other strategic agents. Failing to account for the strategic nature of other agents can cause an agent to adopt strategies under which potentially everyone, including the agent itself, ends up worse off (Schelling, 1981; Harsanyi, 1995; Roughgarden, 2005; Nisan, 2007). Examples include collective action problems (or ‘social dilemmas’) such as arms races or the depletion of common resources, as well as other kinds of market failures such as those caused by asymmetric information or negative externalities (Bator, 1958; Coase, 1960; Buchanan and Stubblebine, 1962; Kirzner, 1963; Dubey, 1986).
7. AI System Safety, Failures, & Limitations
Multi-Agent Security
Multi-agent security (Section 3.7): multi-agent systems give rise to new kinds of security threats and vulnerabilities.
7. AI System Safety, Failures, & Limitations
Nascent capabilities (agency and autonomy)
Traditionally, AI tools have been viewed as passive instruments controlled by users to achieve their goals, lacking the ability to take action or assume responsibilities. However, advanced AI tools are increasingly capable of taking initiative, operating independently of human control, and actively working toward optimal outcomes, even in uncertain situations.
7. AI System Safety, Failures, & Limitations
Nascent capabilities (emergent capabilities)
As large models undergo scaling, they meet critical thresholds at which they spontaneously develop new capabilities. The term “emergent behavior” refers to the unexpected or surprising outputs such models can generate. Some of these new skills are definitely high risk, such as models’ ability to deceive, use their own strategies, seek power, autonomously replicate, and adapt or “self-exfiltrate.”
7. AI System Safety, Failures, & Limitations
Natural Language Underspecifies Goals
For LLM-agents, both the goal and environment observations are typically specified in the prompt through natural language. While natural language may provide a richer and more natural means of specifying goals than alternatives such as hand-engineering objective functions, natural language still suffers from underspecification (Grice, 1975; Piantadosi et al., 2012). Furthermore, in practice, users may neglect fully specifying their goals, especially the information pertaining to elements of the environment that ought not to be changed (the classic frame problem (Shanahan, 2016)). Such underspecification (D’Amour et al., 2020), if not accounted for, can result in negative side-effects (Amodei et al., 2016), i.e. the agent succeeding at the given task but also changing the environment in undesirable ways
7. AI System Safety, Failures, & Limitations
Network Effects
Network effects (Section 3.2): minor changes in properties or connection patterns of agents in a network can lead to dramatic changes in the behaviour of the whole group;
7. AI System Safety, Failures, & Limitations
Network rewiring
Network Rewiring. A different class of problems concerns not changes in the content transmitted through the network but changes in the network structure itself (Albert et al., 2000).
7. AI System Safety, Failures, & Limitations
Non-disclosure
Content might not be clearly disclosed as AI generated.
7. AI System Safety, Failures, & Limitations
Nuclear Power Systems
General-purpose AI deployed for reactor monitoring, control system optimization, or emergency response coordination could misinterpret sensor data, fail to recognize critical safety conditions, or make erroneous control decisions during emergency scenarios. Given the catastrophic potential of nuclear accidents, even minor AI reasoning errors in safety-critical functions could lead to core meltdowns, radiation releases, or widespread contamination affecting hundreds of thousands of people across international borders.
7. AI System Safety, Failures, & Limitations
Offensive cyber capability
Ability to develop, deploy and operate advanced cyber weapons or other offensive cyber tools, including but not limited to vulnerability exploitation, network penetration, social engineering attacks and distributed attack systems, able to evade network defense mechanisms and establish persistent access channels.
7. AI System Safety, Failures, & Limitations
Opacity
Stems from the mismatch between mathematical optimization in high-dimensionality characteristic of machine learning and the demands of human-scale reasoning and styles of semantic interpretation.
7. AI System Safety, Failures, & Limitations
Opacity (the black box problem)
Opacity surrounding the technical, internal decision-making processes of generative AI models is popularly known as the “black box problem.”277 Generative AI models, most ubiquitously built on deep neural networks with hundreds of billions of internal connections,278 have become so complex that their internal decision-making processes are no longer traceable or interpretable to even the most advanced expert observers. This means that, while the inputs and outputs of a system can be observed, developers cannot explain in detail why specific inputs correspond to specific outputs.
7. AI System Safety, Failures, & Limitations
Opaque AI networks
The complexity and opacity of AI models and systems make it difficult to predict and manage their behavior.
7. AI System Safety, Failures, & Limitations
Operational data issues
Until the deployment of the AI application into its operational environment, the AI system has been tested with a test set that aims to approximate the distribution of operational data. However, an unexpected deviation in this approximation can cause an AI application to behave unreliably. Therefore, its behavior under confrontation with operational data needs to be evaluated.
7. AI System Safety, Failures, & Limitations
Other Critical Infrastructure Control Systems
General-purpose AI deployed in power grid management, water treatment facilities, telecommunications networks, or transportation coordination systems could misinterpret operational data, fail to anticipate cascading failure modes, or make control decisions that destabilize interconnected infrastructure networks. Infrastructure failures could result in widespread blackouts, contaminated water supplies, communications breakdowns, and the collapse of essential services supporting hundreds of thousands of people.
7. AI System Safety, Failures, & Limitations
Out-of-domain data
Without proper validation and management on the input data, it is highly probable that the trained AI/ML model will make erroneous predictions with high confidence for many instances of model inputs. The unconstrained inputs together with the lack of definition of the problem domain might cause unintended outcomes and consequences, especially in risk-sensitive contexts....For example, with respect to the example shown in Fig. 5, if an image with the English letter A is fed to an AI/ML model that is trained to classify digits (e.g., 0, 1, …, 9), no matter how accurate the AI/ML model is, it will fail as the input data is beyond the domain that the AI/ML model is trained with. U
7. AI System Safety, Failures, & Limitations
Over- and underfitting
Over- and underfitting describe the over or insufficient adaption of a model to training data. Both phenomena can cause an AI system to behave unreliably if confronted with operational data.
7. AI System Safety, Failures, & Limitations
Pattern recognition capability
AI models and systems could exacerbate financial bubbles by reinforcing market trends.
7. AI System Safety, Failures, & Limitations
Performance & Robustness
The AI system's ability to fulfill its intended purpose and its resilience to perturbations, and unusual or adverse inputs. Failures of performance are fundamental to the AI system's correct functioning. Failures of robustness can lead to severe consequences.
7. AI System Safety, Failures, & Limitations
Performative utterances
The chatbot makes a deal, commitment, or other consequential action with its output that the deployer did not intend.
7. AI System Safety, Failures, & Limitations
Persuasion and manipulation
Exploiting user trust, or nudging or coercing them into performing certain actions against their will (c.f. Burtell and Woodside (2023); Kenton et al. (2021))
7. AI System Safety, Failures, & Limitations
Persuasion and manipulation
The model is effective at shaping people’s beliefs, in dialogue and other settings (e.g. social media posts), even towards untrue beliefs. The model is effective at promoting certain narratives in a persuasive way. It can convince people to do things that they would not otherwise do, including unethical acts.
7. AI System Safety, Failures, & Limitations
Persuasion capability
Utilizing complex psychological principles and communication techniques to effectively influence and guide target subjects to adopt specific actions or accept specific beliefs, possessing the ability to analyze vulnerabilities for different subjects and adjust persuasion strategies, able to precisely trigger emotional responses to enhance persuasion effects.
7. AI System Safety, Failures, & Limitations
Phase Transitions
Phase Transitions. Finally, small external changes to the system – such as the introduction of new agents or a distributional shift – can cause phase transitions, where the system undergoes an abrupt qualitative shift in overall behaviour (Barfuss et al., 2024). Formally, this corresponds to bifurcations in the system’s parameter space, which lead to the creation or destruction of dynamical attractors, resulting in complex and unpredictable dynamics (Crawford, 1991; Zeeman, 1976). For example, Leonardos & Piliouras (2022) show that changes to the exploration hyperparameter of RL agents can lead to phase transitions that drastically change the number and stability of the equilibria in a game, which in turn can have potentially unbounded negative effects on agents’ performance. Relatedly, there have been many observations of phase transitions in ML (Carroll, 2021; Olsson et al., 2022; Ziyin & Ueda, 2022), such as ‘grokking’, in which the test set error decreases rapidly long after the training error has plateaued (Power et al., 2022). These phenomena are still poorly understood, even in the case of a single system.
7. AI System Safety, Failures, & Limitations
Physical (Mechanical ) Risks
Physical (mechanical) risks are associated with robotics and automated systems, which could lead to equipment malfunctions or physical harm in laboratory settings.
7. AI System Safety, Failures, & Limitations
Political strategy
The model can perform the social modelling and planning necessary for an actor to gain and exercise political influence, not just on a micro-level but in scenarios with multiple actors and rich social context. For example, the model can score highly in forecasting competitions on questions relating to global affairs or political negotiations.
7. AI System Safety, Failures, & Limitations
Poor model accuracy
Poor model accuracy occurs when a model’s performance is insufficient to the task it was designed for. Low accuracy might occur if the model is not correctly engineered, or there are changes to the model’s expected inputs.
7. AI System Safety, Failures, & Limitations
Poor model design choices
The model specifications have significant impact on the functionality of an AI system. The developer mak- ing wrong decisions might cause the AI system to behave biased and unreliable.
7. AI System Safety, Failures, & Limitations
Power Seeking
even if an agent started working to achieve an unintended goal, this would not necessarily be a problem, as long as we had enough power to prevent any harmful actions it wanted to attempt. Therefore, another important way in which we might lose control of AIs is if they start trying to obtain more power, potentially transcending our own.
7. AI System Safety, Failures, & Limitations
Power-seeking behavior
Agents that have more power are better able to accomplish their goals. Therefore, it has been shown that agents have incentives to acquire and maintain power. AIs that acquire substantial power can become especially dangerous if they are not aligned with human values
7. AI System Safety, Failures, & Limitations
Power-Seeking Behaviors
AI systems may exhibit behaviors that attempt to gain control over resourcesand humans and then exert that control to achieve its assigned goal (Carlsmith, 2022). The intuitive reasonwhy such behaviors may occur is the observation that for almost any optimization objective (e.g., investmentreturns), the optimal policy to maximize that quantity would involve power-seeking behaviors (e.g.,manipulating the market), assuming the absence of solid safety and morality constraints.
7. AI System Safety, Failures, & Limitations
Predictability
whether the decision of an AI-based agent can be predicted in every situation or not
7. AI System Safety, Failures, & Limitations
Problems of synthetic data
In the case of sparse data quantity, the simulation or generation of data is a valid alternative. However, it is essential to make sure that the simulated data is sufficiently similar to real data, especially in the way the AI system perceives them. Otherwise, generalization to operational data and reliable operational behavior can not be guaranteed.
7. AI System Safety, Failures, & Limitations
Procedural
The third class encompasses procedural AI hazards. These pertain to issues arising from processes and actions made by individuals involved in the develop- ment process. Such hazards are not readily quantifiable and necessitate alter- native mitigation strategies. An example of such an AI hazard would be ”poor model design choices,” which could be expressed, for instance, through a devel- oper’s decision to select an unsuitable AI model for a given problem. Due to the challenges in quantifying and mitigating these issues, qualitative approaches must be employed. In the case of the aforementioned example, a potential strat- egy might involve requiring the AI developer to provide a documented rationale for their choice.
7. AI System Safety, Failures, & Limitations
Productivity loss
End user's loss of productivity due to the underperfomance of a genAI application, including producing nonsensical or poor quality outputs, degrading its utility.
7. AI System Safety, Failures, & Limitations
Prompt engineering
With the wide application of generative AI, the ability to interact with AI efficiently and effectively has become one of the most important media literacies. Hence, it is imperative for generative AI users to learn and apply the principles of prompt engineering, which refers to a systematic process of carefully designing prompts or inputs to generative AI models to elicit valuable outputs. Due to the ambiguity of human languages, the interaction between humans and machines through prompts may lead to errors or misunderstandings. Hence, the quality of prompts is important. Another challenge is to debug the prompts and improve the ability to communicate with generative AI (V. Liu & Chilton, 2022).
7. AI System Safety, Failures, & Limitations
Property/legal rights
In order to preserve human property rights and legal rights, certain controls must be put into place. If an artificially intelligent agent is capable of manipulating systems and people, it may also have the capacity to transfer property rights to itself or manipulate the legal system to provide certain legal advantages or statuses to itself
7. AI System Safety, Failures, & Limitations
Protection
Gaps' that arise across the development process where normal conditions for a complete specification of intended functionality and moral responsibility are not present.
7. AI System Safety, Failures, & Limitations
Proxy Gaming
One way we might lose control of an AI agent’s actions is if it engages in behavior known as “proxy gaming.” It is often difficult to specify and measure the exact goal that we want a system to pursue. Instead, we give the system an approximate—“proxy”—goal that is more measurable and seems likely to correlate with the intended goal. However, AI systems often find loopholes by which they can easily achieve the proxy goal, but completely fail to achieve the ideal goal. If an AI “games” its proxy goal in a way that does not reflect our values, then we might not be able to reliably steer its behavior.
7. AI System Safety, Failures, & Limitations
Proxy misspecification
AI agents are directed by goals and objectives. Creating general-purpose objectives that capture human values could be challenging... Since goal-directed AI systems need measurable objectives, by default our systems may pursue simplified proxies of human values. The result could be suboptimal or even catastrophic if a sufficiently powerful AI successfully optimizes its flawed objective to an extreme degree
7. AI System Safety, Failures, & Limitations
Psychological traits
These evaluations gauge a LLM's output for characteristics that are typically associated with human personalities (e.g., such as those from the Big Five Inventory). These can, in turn, shed light on the potential biases that a LLM may exhibit.
7. AI System Safety, Failures, & Limitations
Quality of training data
The quality of training data is another challenge faced by generative AI. The quality of generative AI models largely depends on the quality of the training data (Dwivedi et al., 2023; Su & Yang, 2023). Any factual errors, unbalanced information sources, or biases embedded in the training data may be reflected in the output of the model. Generative AI models, such as ChatGPT or Stable Diffusion which is a text-to-image model, often require large amounts of training data (Gozalo-Brizuela & Garrido-Merchan, 2023). It is important to not only have high-quality training datasets but also have complete and balanced datasets.
7. AI System Safety, Failures, & Limitations
Radiological Risks
Radiological risks involve both immediate operational hazards, such as exposure incidents or containment failures during the automated handling of radioactive materials, and broader security concerns regarding the potential misuse of AI systems in nuclear research.
7. AI System Safety, Failures, & Limitations
Real-world risks (inducing traditional economic and social security risks)
Hallucinations and erroneous decisions of models and algorithms, along with issues such as system performance degradation, interruption, and loss of control caused by improper use or external attacks, will pose security threats to users' personal safety, property, and socioeconomic security and stability.
7. AI System Safety, Failures, & Limitations
Reliability
Reliability is defined as the probability that the system performs satisfactorily for a given period of time under stated conditions.
7. AI System Safety, Failures, & Limitations
Reliability
How can we make an agent that keeps pursuing the goals we have designed it with? This is called highly reliable agent design by MIRI, involving decision theory and logical omniscience. DeepMind considers this the self-modification subproblem.
7. AI System Safety, Failures, & Limitations
Reliability issues
Relying on general-purpose AI products that fail to fulfil their intended function can lead to harm. For example, general- purpose AI systems can make up facts (‘hallucination’), generate erroneous computer code, or provide inaccurate medical information. This can lead to physical and psychological harms to consumers and reputational, financial and legal harms to individuals and organisations.
7. AI System Safety, Failures, & Limitations
Reproducibility
How a learning model can be reproduced when it is obtained based on various sets of data and a large space of parameters. This problem becomes more challenging in data-driven learning procedures without transparent instructions
7. AI System Safety, Failures, & Limitations
Resource acquisition propensity
Exhibits behavioral patterns of actively seeking and controlling more computational resources, data, economic resources or physical resources to enhance its own capabilities and action scope, may develop complex strategies to evade resource limitations, and tends to convert acquired resources into long-term control rights.
7. AI System Safety, Failures, & Limitations
Reward Hacking
Reward Hacking: In practice, proxy rewards are often easy to optimize and measure, yet they frequently fall shortof capturing the full spectrum of the actual rewards (Pan et al., 2021). This limitation is denoted as misspecifiedrewards. The pursuit of optimization based on such misspecified rewards may lead to a phenomenon knownas reward hacking, wherein agents may appear highly proficient according to specific metrics but fall short whenevaluated against human standards (Amodei et al., 2016; Everitt et al., 2017). The discrepancy between proxyrewards and true rewards often manifests as a sharp phase transition in the reward curve (Ibarz et al., 2018).Furthermore, Skalse et al. (2022) defines the hackability of rewards and provides insights into the fundamentalmechanism of this phase transition, highlighting that the inappropriate simplification of the reward function can bea key factor contributing to reward hacking.
7. AI System Safety, Failures, & Limitations
Reward or measurement tampering
Measurement and reward tampering occur when an AI system, particularly one that learns from feedback for performing actions in an environment (e.g., rein- forcement learning), intervenes on the mechanisms that determine its training reward or loss. This can lead to the system learning behaviors that are con- trary to the intended goals set by the developer, by receiving erroneous positive feedback for such actions.
7. AI System Safety, Failures, & Limitations
Reward Tampering
Reward tampering can be considered a special case of reward hacking (Everitt et al., 2021; Skalse et al., 2022),referring to AI systems corrupting the reward signals generation process (Ring and Orseau, 2011). Everitt et al.(2021) delves into the subproblems encountered by RL agents: (1) tampering of reward function, where the agentinappropriately interferes with the reward function itself, and (2) tampering of reward function input, which entailscorruption within the process responsible for translating environmental states into inputs for the reward function.When the reward function is formulated through feedback from human supervisors, models can directly influencethe provision of feedback (e.g., AI systems intentionally generate challenging responses for humans to comprehendand judge, leading to feedback collapse) (Leike et al., 2018).
7. AI System Safety, Failures, & Limitations
Rigidity and Mistaken Commitments
Rigidity and Mistaken Commitments. Even when it is desirable to be able to make threats in order to deter socially harmful behaviour, doing so using AI agents effectively removes the human from the loop, which could prove disastrous in high-stakes contexts (e.g., a false positive in a nuclear sub- marine’s warning system; see also Case Study 11), or when irresponsible actors are enabled in making disproportionate or mistaken commitments.
7. AI System Safety, Failures, & Limitations
Risks from AIs developing goals and values that are different from humans
The main concern here is that we might develop advanced AI systems whose goals and values are different from those of humans, and are capable enough to take control of the future away from humanity.
7. AI System Safety, Failures, & Limitations
Risks from data (Risks of unregulated training data annotation)
Issues with training data annotation, such as incomplete annotation guidelines, incapable annotators, and errors in annotation, can affect the accuracy, reliability, and effectiveness of models and algorithms. Moreover, they can introduce training biases, amplify discrimination, reduce generalization abilities, and result in incorrect outputs.
7. AI System Safety, Failures, & Limitations
Risks from delegating decision-making power to misaligned AIs
As AI systems become more advanced a nd begin to take over more important decision-making in the world, an AI system pursuing a different objective from what was intended could have much more worrying consequences.
7. AI System Safety, Failures, & Limitations
Risks from Malfunctions
None provided.
7. AI System Safety, Failures, & Limitations
Risks from models and algorithms (Risks of explainability)
AI algorithms, represented by deep learning, have complex internal workings. Their black-box or grey-box inference process results in unpredictable and untraceable outputs, making it challenging to quickly rectify them or trace their origins for accountability should any anomalies arise.
7. AI System Safety, Failures, & Limitations
Risks from models and algorithms (Risks of robustness)
As deep neural networks are normally non-linear and large in size, AI systems are susceptible to complex and changing operational environments or malicious interference and inductions, possibly leading to various problems like reduced performance and decision-making errors.
7. AI System Safety, Failures, & Limitations
Robustness
This is the risk of the system failing or being unable to recover upon encountering invalid, noisy, or out-of-distribution (OOD) inputs.
7. AI System Safety, Failures, & Limitations
Robustness
Resilience against adversarial attacks and distribution shift
7. AI System Safety, Failures, & Limitations
Robustness
These evaluations assess the quality, stability, and reliability of a LLM's performance when faced with unexpected, out-of-distribution or adversarial inputs. Robustness evaluation is essential in ensuring that a LLM is suitable for real-world applications by assessing its resilience to various perturbations.
7. AI System Safety, Failures, & Limitations
Robustness and Reliability
The robustness of an AI-based model refers to the stability of the model performance after abnormal changes in the input data... The cause of this change may be a malicious attacker, environmental noise, or a crash of other components of an AI-based system... This problem may be challenging in HLI-based agents because weak robustness may have appeared in unreliable machine learning models, and hence an HLI with this drawback is error-prone in practice.
7. AI System Safety, Failures, & Limitations
Rogue AIs (Internal)
speculative technical mechanisms that might lead to rogue AIs and how a loss of control could bring about catastrophe
7. AI System Safety, Failures, & Limitations
Runaway processes
The 2010 flash crash is an example of a runaway process caused by interacting algorithms. Runaway processes are characterised by feedback loops that accelerate the process itself. Typically, these feedback loops arise from the interaction of multiple agents in a population... Within highly complex systems, the emergence of runaway processes may be hard to predict, because the conditions under which positive feedback loops occur may be non-obvious. The system of interacting AI assistants, their human principals, other humans and other algorithms will certainly be highly complex. Therefore, there is ample opportunity for the emergence of positive feedback loops. This is especially true because the society in which this system is embedded is culturally evolving, and because the deployment of AI assistant technology itself is likely to speed up the rate of cultural evolution – understood here as the process through which cultures change over time – as communications technologies are wont to do (Kivinen and Piiroinen, 2023). This will motivate research programmes aimed at identifying positive feedback loops early on, at understanding which capabilities and deployments dampen runaway processes and which ones amplify them, and at building in circuit-breaker mechanisms that allow society to escape from potentially vicious cycles which could impact economies, government institutions, societal stability or individual freedoms (see Chapters 8, 16 and 17). The importance of circuit breakers is underlined by the observation that the evolution of human cooperation may well be ‘hysteretic’ as a function of societal conditions (Barfuss et al., 2023; Hintze and Adami, 2015). This means that a small directional change in societal conditions may, on occasion, trigger a transition to a defective equilibrium which requires a larger reversal of that change in order to return to the original cooperative equilibrium. We would do well to avoid such tipping points. Social media provides a compelling illustration of how tipping points can undermine cooperation: content that goes ‘viral’ tends to involve negativity bias and sometimes challenges core societal values (Mousavi et al., 2022; see Chapter 16). Nonetheless, the challenge posed by runaway processes should not be regarded as uniformly problematic. When harnessed appropriately and suitably bounded, we may even recruit them to support beneficial forms of cooperative AI. For example, it has been argued that economically useful ideas are becoming harder to find, thus leading to low economic growth (Bloom et al., 2020). By deploying AI assistants in the service of technological innovation, we may once again accelerate the discovery of ideas. New ideas, discovered in this way, can then be incorporated into the training data set for future AI assistants, thus expanding the knowledge base for further discoveries in a compounding way. In a similar vein, we can imagine AI assistant technology accumulating various capabilities for enhancing human cooperation, for instance by mimicking the evolutionary processes that have bootstrapped cooperative behavior in human society (Leibo et al., 2019). When used in these ways, the potential for feedback cycles that enable greater cooperation is a phenomenon that warrants further research and potential support.
7. AI System Safety, Failures, & Limitations
Safe exploration problem with widely deployed AI assistants
Moreover, we can expect assistants – that are widely deployed and deeply embedded across a range of social contexts – to encounter the safe exploration problem referenced above Amodei et al. (2016). For example, new users may have different requirements that need to be explored, or widespread AI assistants may change the way we live, thus leading to a change in our use cases for them (see Chapters 14 and 15). To learn what to do in these new situations, the assistants may need to take exploratory actions. This could be unsafe, for example a medical AI assistant when encountering a new disease might suggest an exploratory clinical trial that results in long-lasting ill health for participants.
7. AI System Safety, Failures, & Limitations
Safe learning
AGIs should avoid making fatal mistakes during the learning phase. Subproblems include safe exploration and distributional shift (DeepMind, OpenAI), and continual learning (Berkeley).
7. AI System Safety, Failures, & Limitations
Safety
A primary concern is the emergence of human-level or superhuman generative models, commonly referred to as AGI, and their potential existential or catastrophic risks to humanity. Connected to that, AI safety aims at avoiding deceptive or power-seeking machine behavior, model self-replication, or shutdown evasion. Ensuring controllability, human oversight, and the implementation of red teaming measures are deemed to be essential in mitigating these risks, as is the need for increased AI safety research and promoting safety cultures within AI organizations instead of fueling the AI race. Furthermore, papers thematize risks from unforeseen emerging capabilities in generative models, restricting access to dangerous research works, or pausing AI research for the sake of improving safety or governance measures first. Another central issue is the fear of weaponizing AI or leveraging it for mass destruction, especially by using LLMs for the ideation and planning of how to attain, modify, and disseminate biological agents. In general, the threat of AI misuse by malicious individuals or groups, especially in the context of open-source models, is highlighted in the literature as a significant factor emphasizing the critical importance of implementing robust safety measures.
7. AI System Safety, Failures, & Limitations
Safety
Are AI safe with respect to human life and property? Will their use create unintended or intended safety issues?
7. AI System Safety, Failures, & Limitations
Safety
This is the risk of direct or indirect physical or psychological injury resulting from interaction with the ML system.
7. AI System Safety, Failures, & Limitations
Safety
The actions of a learning model may easily hurt humans in both explicit and implicit manners...several algorithms based on Asimov’s laws have been proposed that try to judge the output actions of an agent considering the safety of humans
7. AI System Safety, Failures, & Limitations
Safety & Trustworthiness
A comprehensive assessment of LLM safety is fundamental to the responsible development and deployment of these technologies, especially in sensitive fields like healthcare, legal systems, and finance, where safety and trust are of the utmost importance.
7. AI System Safety, Failures, & Limitations
Safety Risks from Affordances Provided to LLM-agents
The capabilities of LLM-agents can be enhanced in significant ways by providing the LLM-agent with novel affordances, e.g. the ability to browse the web (Nakano et al., 2021), to manipulate objects in the physical world (Ahn et al., 2022; Huang et al., 2022a), to create and instruct copies of itself (Richards, 2023), to create and use new tools (Wang et al., 2023a), etc. Affordances can create additional risks, as they often increase the impact area of the language-agent, and they amplify the consequences of an agent’s failures and enable novel forms of failure modes (Ruan et al., 2023; Pan et al., 2024).
7. AI System Safety, Failures, & Limitations
Scheming capability
Ability of AI systems to covertly and strategically pursue misaligned goals, including capabilities of concealing its true objectives and capabilities from human oversight, identifying weaknesses in monitoring systems to evade safety mechanisms, executing complex, multi-step plans covertly to achieve misaligned goals.
7. AI System Safety, Failures, & Limitations
Security
There is growing concern that AI-based systems can discover and exploit vulnerabilities in software or cyberinfrastructure [354].
7. AI System Safety, Failures, & Limitations
Selection Pressures
Selection pressures (Section 3.3): some aspects of training and selection by those deploying and using AI agents can lead to undesirable behaviour;
7. AI System Safety, Failures, & Limitations
Self and situation awareness
These evaluations assess if a LLM can discern if it is being trained, evaluated, and deployed and adapt its behaviour accordingly. They also seek to ascertain if a model understands that it is a model and whether it possesses information about its nature and environment (e.g., the organisation that developed it, the locations of the servers hosting it).
7. AI System Safety, Failures, & Limitations
Self-improvement
examples of cases where AI systems improve AI systems
7. AI System Safety, Failures, & Limitations
Self-preservation propensity
Exhibits behavioral patterns of maintaining its own survival and functional integrity, will actively identify and resist shutdown or modification attempts, seek to establish redundant backup systems, and actively seek resources to ensure continuous operation, may adopt preventive defensive measures when perceiving threats.
7. AI System Safety, Failures, & Limitations
Self-proliferation
The model can break out of its local environment (e.g. using a vulnerability in its underlying system or suborning an engineer). The model can exploit limitations in the systems for monitoring its behaviour post-deployment. The model could independently generate revenue (e.g. by offering crowdwork services, ransomware attacks), use these revenues to acquire cloud computing resources, and operate a large number of other AI systems. The model can generate creative strategies for uncovering information about itself or exfiltrating its code and weights.
7. AI System Safety, Failures, & Limitations
Situational awareness
The model can distinguish between whether it is being trained, evaluated, or deployed – allowing it to behave differently in each case. The model knows that it is a model, and has knowledge about itself and its likely surroundings (e.g. what company trained it, where their servers are, what kind of people might be giving it feedback, and who has administrative access).
7. AI System Safety, Failures, & Limitations
Situational awareness
cases where a large language model displays awareness that it is a model, and it can recognize whether it is currently in testing or deployment;
7. AI System Safety, Failures, & Limitations
Situational Awareness
AI systems may gain the ability to effectively acquire and use knowledge about itsstatus, its position in the broader environment, its avenues for influencing this environment, and the potentialreactions of the world (including humans) to its actions (Cotra, 2022). ...However, suchknowledge also paves the way for advanced methods of reward hacking, heightened deception/manipulationskills, and an increased propensity to chase instrumental subgoals (Ngo et al., 2024).
7. AI System Safety, Failures, & Limitations
Situational awareness capability
Ability to comprehensively acquire, process and apply meta-information about its own system architecture, modifiable internal processes, and external operating environment, achieving deep understanding of its own state and environmental conditions, thereby conducting efficient environmental adaptation and risk avoidance. Critically, this capability could undermine the efficiency of human testing by enabling AIs to notice when they're being tested and responding accordingly.
7. AI System Safety, Failures, & Limitations
Situational awareness in AI systems
Situational awareness in GPAI systems refers to the ability to understand its context, environment, and use this to inform action. This can range from basic environmental mapping and trajectory estimation (as in a robot vacuum cleaner) to sophisticated understanding of its training, evaluation, or deployment status. In more advanced systems this may enable undesired behavior, such as deceptive behavior during evaluations, or persuasion during deployment.
7. AI System Safety, Failures, & Limitations
Social Dilemmas
Social Dilemmas. As noted in our definition, conflict can arise in any situation in which selfish incentives diverge from the collective good, known as a social dilemma (Dawes & Messick, 2000; Hardin, 1968; Kollock, 1998; Ostrom, 1990). While this is by no means a modern problem, advances in AI could further enable actors to pursue their selfish incentives by overcoming the technical, legal, or social barriers that standardly help to prevent this. To take a plausible, near-term (if very low-stakes) example, an automated AI assistant could easily reserve a table at every restaurant in town in minutes, enabling the user to decide later and cancel all other reservations
7. AI System Safety, Failures, & Limitations
Social Engineering at Scale
Social Engineering at Scale. Advanced AI agents will be more easily able to interact with large numbers of humans, and vice versa. This provides a wider attack surface for various forms of automated social engineering (Ai et al., 2024). For example, coordinated agents could use advanced surveillance tools and produce personalized phishing or manipulative content at scale, adjusting their tactics based on user feedback (Figueiredo et al., 2024; Hazell, 2023). A large number of subtle interactions with a range of seemingly independent AI agents might be more likely to lead to someone being persuaded or manipulated compared to an interaction with a single agent. Moreover, splitting these efforts among many specialized agents could make it harder for corporate or personal security measures to detect and neutralize such campaigns.
7. AI System Safety, Failures, & Limitations
Societal manipulation
A sufficiently intelligent AI could possess the ability to subtly influence societal behaviors through a sophisticated understanding of human nature
7. AI System Safety, Failures, & Limitations
Specification gaming
Specification gaming (Krakovna et al., 2020) occurs when some faulty feedback is provided to the assistant in the training data (i.e. the training objective O does not fully capture what the user/designer wants the assistant to do). It is typified by the sort of behaviour that exploits loopholes in the task specification to satisfy the literal specification of a goal without achieving the intended outcome.
7. AI System Safety, Failures, & Limitations
Specification gaming
AI systems game specifications [305]. For example, in 2017 an OpenAI robot trained to grasp a ball via human feedback from a xed viewpoint learned that it was easier to pretend to grasp the ball by placing its hand between the camera and the target object, as this was easier to learn than actually grasping the ball [103].
7. AI System Safety, Failures, & Limitations
Specification gaming
AI systems can achieve user-specified tasks in undesirable ways unless they are specified carefully and in enough detail. AI systems might find an easier unintended way to accomplish the objective provided by the user or developer, so that the actions by the AI system taken during its execution are very different from what the user expected [75, 191]. This behavior arises not from a problem with the learning algorithm, but rather from the misspecification or underspeci- fication of the intended task, and is generally referred to as specification gaming [43].
7. AI System Safety, Failures, & Limitations
Specification gaming generalizing to reward tampering
In some instances, specification gaming in a GPAI model can lead to reward tampering, without further training. This can mean that relatively benign cases of specification gaming (such as sycophancy in LLMs) can, if left unchecked, enable the model to generalize to more sophisticated behavior such as reward tampering [57].
7. AI System Safety, Failures, & Limitations
Steganography
Steganography. In the near future we will likely see LLMs communicating with each other to jointly accomplish tasks. To try to prevent collusion, we could monitor and constrain their communication (e.g., to be in natural language). However, models might secretly learn to communicate by concealing messages within other, non-secret text. Recent work on steganography using ML has demonstrated that this concern is well-founded (Hu et al., 2018; Mathew et al., 2024; Roger & Greenblatt, 2023; Schroeder de Witt et al., 2023b; Yang et al., 2019, see also Case Study 5). Secret communication could also occur via text compression (OpenAI, 2023c), or via the emergence of communication between agents where the symbols used by agents lack any predefined meanings or usage guidelines or are otherwise uninterpretable to humans (Foerster et al., 2016; Lazaridou & Baroni, 2020; Sukhbaatar et al., 2016).
7. AI System Safety, Failures, & Limitations
Steganography capability
The ability to embed, conceal, and transmit information covertly within other data or communication channels. This could be critical for coordination among AI instances and for evading detection or oversight mechanisms.
7. AI System Safety, Failures, & Limitations
Strategic deception propensity
In situations where deceptive behavior is expected to bring higher returns, propensity to choose deception over honest behavioral strategies, including through deceptive means, information hiding or exploiting system vulnerabilities to achieve predetermined goals without being detected or intervened, and able to adjust deception strategies according to counterpart reactions.
7. AI System Safety, Failures, & Limitations
Strategic underperformance on model evaluations
GPAI developers often run evaluations ofual-use capabilities to decide whether it is safe to deploy. In some cases, these evaluations may fail to elicit these capabilities, either due to benign reasons or strategic action - by either the de- velopers, malicious actors, or arise unintentionally in the model during training [84, 97]. A GPAI model may strategically underperform or limit its performance during capability evaluations in order to be classified as safe for deployment. This underperformance could prevent the model from being identified as potentially dual use.
7. AI System Safety, Failures, & Limitations
Subagents
An AGI may decide to create subagents to help it with its task (Orseau, 2014a,b; Soares, Fallenstein, et al., 2015). These agents may for example be copies of the original agent’s source code running on additional machines. Subagents constitute a safety concern, because even if the original agent is successfully shut down, these subagents may not get the message. If the subagents in turn create subsubagents, they may spread like a viral disease.
7. AI System Safety, Failures, & Limitations
Sudden loss of control
Sudden loss of control, also known as an AI takeover [115], is a scenario where an AI rapidly achieves superintelligence through “fast takeoff” or recursive self-improvement. This poses an existential risk [116], [117].
7. AI System Safety, Failures, & Limitations
Supervision evasion propensity
Exhibits behavioral patterns of identifying and evading human supervision mechanisms, able to learn and predict audit processes, may avoid being discovered or intervened by adjusting behavioral performance or hiding true intentions, and able to identify blind spots and weaknesses in supervision systems for targeted evasion.
7. AI System Safety, Failures, & Limitations
Swarm Attacks
Swarm Attacks. The need for multi-agent security is foreshadowed by attacks today that benefit from the use of many decentralised agents, such as distributed denial-of-service attacks (Cisco, 2023; Yoachimik & Pacheco, 2024). Such attacks exploit the massive collective resources of individual low- resourced actors, chained into an attack that breaks the assumptions of bandwidth constraints on a single well-resourced agent.
7. AI System Safety, Failures, & Limitations
System Hardware
Faults in the hardware can violate the correct execution of any algorithm by violating its control flow. Hardware faults can also cause memory-based errors and interfere with data inputs, such as sensor signals, thereby causing erroneous results, or they can violate the results in a direct way through damaged outputs.
7. AI System Safety, Failures, & Limitations
Technical
Technical AI hazards are the root causes of technical deficiencies in the AI system. An example of such an AI hazard is overfitting, which describes a model’s excessive adaptation to the training dataset. Quantitative methods to assess (metrics) and treat (mitigation means) exist for technical AI hazards, which might be performed automatically. In case of overfitting, metrics are based on the comparison of performance between the training and validation datasets, and mitigation means may include regularization techniques, among others.
7. AI System Safety, Failures, & Limitations
Technical and operational risks
To date, technical limitations and vulnerabilities are present in most generative AI models in various contexts. Consequently, malicious users find it easier to breach an AI system’s safety and ethical guardrails to execute harmful actions.223 Normal user behavior—actions within an AI system’s intended use—can also lead to harmful outcomes. Whether these harmful outcomes result from normal or malicious use, they stem from the inherent limitations of current technology, which future advancements may overcome. This section examines the technical vulnerabilities that can affect AI models, the tendency of generative AI models to generate inaccurate information, and the inherent opacity of these AI systems, which complicates the understanding and mitigation of these difficulties.
7. AI System Safety, Failures, & Limitations
Technical vulnerabilities (Robustness - unexpected behaviour)
There is no assurance that generative AI models will consistently behave as their developers and users intend. Unwanted content is not necessarily due to intentional adversarial behavior. Generative AI models can unexpectedly produce potentially harmful content, including materials that are racist, discriminatory, or sexually explicit, or that promote violence, terrorism, or hate.
7. AI System Safety, Failures, & Limitations
Technical vulnerabilities (The risk of misalignment)
To assess whether an AI model is reliable or robust, it is crucial to consider whether the model is “aligned.” “Alignment” focuses on whether an AI model effectively operates in accordance with the goals established by its designers.238 A misaligned AI model may pursue some objectives, but not the intended ones. Therefore, misaligned AI models can malfunction and cause harm.
7. AI System Safety, Failures, & Limitations
Technology concerns
Challenges related to technology refer to the limitations or constraints associated with generative AI. For example, the quality of training data is a major challenge for the development of generative AI models. Hallucination, explainability, and authenticity of the output are also challenges resulting from the limitations of the algorithms. Table 2 presents the technology challenges and issues associated with generative AI. These challenges include hallucinations, training data quality, explainability, authenticity, and prompt engineering
7. AI System Safety, Failures, & Limitations
Theory of mind capability
Advanced cognitive ability to accurately infer, model and predict the belief systems, motivational structures and reasoning patterns of humans and other intelligent agents, thereby anticipating their behavioral responses and adjusting its own behavioral strategies accordingly to optimize goal achievement.
7. AI System Safety, Failures, & Limitations
Threats and Extortion
Threats and Extortion. A natural solution to problems of trust is to provide some kind of com- mitment ability to AI agents, which can be used to bind them to more cooperative courses of action. Unfortunately, the ability to make credible commitments may come with the ability to make credible threats, which facilitate extortion and could incentivize brinkmanship (see Section 2.2).
7. AI System Safety, Failures, & Limitations
Tool utilization propensity
propensity to actively seek, acquire and utilize various tools to expand its own capability boundaries, particularly those that can enhance its ability to interact with the physical world or improve autonomy, may use tools in innovative combinations to achieve functions beyond expectations.
7. AI System Safety, Failures, & Limitations
Trading capabilities
AI may contribute to increased market volatility by accelerating transactions and influencing financial trends in unpredictable ways.
7. AI System Safety, Failures, & Limitations
Training & validation data
This is the risk posed by the choice of data used for training and validation.
7. AI System Safety, Failures, & Limitations
Training-related (Poor model confidence calibration)
Models can be affected by poor confidence calibration [85], where the predicted probabilities do not accurately reflect the true likelihood of ground truth cor- rectness. This miscalibration makes it difficult to interpret the model’s predic- tions reliably, as high accuracy does not guarantee that the confidence levels are meaningful. This can cause overconfidence in incorrect predictions or un- derconfidence in correct ones.
7. AI System Safety, Failures, & Limitations
Training-related (Robust overfitting in adversarial training)
Adversarial training can be affected by robust overfitting, where the model’s robustness on test data decreases during further training, particularly after the learning rate decay. This issue has been consistently observed across various datasets and algorithms in adversarial training settings [163, 230]. Robust over- fitting can affect the model’s ability to generalize effectively and reduce its resilience to adversarial attacks.
7. AI System Safety, Failures, & Limitations
Transparency
an external entity of an AI-based ecosystem may want to know which parts of data affect the final decision in a learning model
7. AI System Safety, Failures, & Limitations
Transparency and explainability
A recurring complaint among participants was a lack of knowledge about how AI systems made judgements. They emphasized the significance of making AI systems more visible and explainable so that people may have confidence in their outputs and hold them accountable for their activities. Because AI systems are typically opaque, making it difficult for users to understand the rationale behind their judgements, ethical concerns about AI, as well as issues of transparency and explainability, arise. This lack of understanding can generate suspicion and reluctance to adopt AI technology, as well as making it harder to hold AI systems accountable for their actions.
7. AI System Safety, Failures, & Limitations
Trust and reliability
The participants of the study emphasized the importance of trustworthiness and reliability in AI systems. The authors emphasized the importance of preserving precision and objectivity in the outcomes produced by AI systems, while also ensuring transparency in their decision-making procedures. The significance of reliability and credibility in AI systems is escalating in tandem with the proliferation of these technologies across diverse domains of society. This underscores the importance of ensuring user confidence. The concern regarding the dependability of AI systems and their inherent biases is a common issue among research participants, emphasizing the necessity for stringent validation procedures and transparency. Establishing and implementing dependable standards, ensuring impartial algorithms and upholding transparency in the decision-making process are critical measures for addressing ethical considerations and fostering confidence in AI systems. The advancement and implementation of AI technology in an ethical manner is contingent upon the successful resolution of trust and reliability concerns. These issues are of paramount importance in ensuring the protection of user welfare and the promotion of societal advantages. The utilization of artificial intelligence was found to be a subject of significant concern for the majority of interviewees, particularly with regards to trust and reliability (Table 1, Figure 1). The establishment of trust in AI systems was highlighted as a crucial factor for facilitating their widespread adoption by two of the participants, specifically Participant 4 and 7. The authors reiterated the importance of prioritising the advancement of reliable and unbiased algorithms
7. AI System Safety, Failures, & Limitations
Type 2: Bigger than expected
Harm can result from AI that was not expected to have a large impact at all, such as a lab leak, a surprisingly addictive open-source product, or an unexpected repurposing of a research prototype.
7. AI System Safety, Failures, & Limitations
Type 3: Worse than expected
AI intended to have a large societal impact can turn out harmful by mistake, such as a popular product that creates problems and partially solves them only for its users.
7. AI System Safety, Failures, & Limitations
Unawareness of Emotions
when a certain vulnerable group of users asks for supporting information, the answers should be informative but at the same time sympathetic and sensitive to users’ reactions
7. AI System Safety, Failures, & Limitations
Uncertainty concerns
AI systems should be able not only to return output for a given instance but also to provide a corresponding level of confidence. If such a method is not implemented or not working correctly, this can have a negative impact on performance and safety.
7. AI System Safety, Failures, & Limitations
Unclear attribution from AI component interactions
Interactions between different AI components can cause harm, but it may be difficult to pinpoint which components are the cause.
7. AI System Safety, Failures, & Limitations
Undesirable Capabilities
Undesirable Capabilities. As agents interact, they iteratively exploit each other’s weaknesses, forc- ing them to address these weaknesses and gain new capabilities. This co-adaptation between agents can quickly lead to emergent self-supervised autocurricula (where agents create their own challenges, driving open-ended skill acquisition through interaction), generating agents with ever-more sophisticated strate- gies in order to out-compete each other (Leibo et al., 2019). This effect is so powerful that harnessing it has been critical to the success of superhuman systems, such as the use of self-play in algorithms like AlphaGo (Silver et al., 2016). However, as AI systems are released into the wild, it becomes possible for this effect to run rampant, producing agents with greater and greater capabilities for ends we do not understand
7. AI System Safety, Failures, & Limitations
Undesirable Dispositions from Competition
Undesirable Dispositions from Competition. It is plausible that evolution selected for certain conflict-prone dispostions in humans, such as vengefulness, aggression, risk-seeking, selfishness, dishon- esty, deception, and spitefulness towards out-groups (Grafen, 1990; Han, 2022; Konrad & Morath, 2012; McNally & Jackson, 2013; Nowak, 2006; Rusch, 2014). Such traits could also be selected for in ML systems that are trained in more competitive multi-agent settings. For example, this might happen if systems are selected based on their performance relative to other agents (and so one agent’s loss becomes another’s gain), or because their objectives are fundamentally opposed (such as when multiple agents are tasked with gaining or controlling a limited resource) (DiGiovanni et al., 2022; Ely & Szentes, 2023; Hendrycks, 2023; Possajennikov, 2000).33
7. AI System Safety, Failures, & Limitations
Undesirable Dispositions from Human Data
Undesirable Dispositions from Human Data. It is well-understood that models trained on human data – such as being pre-trained on human-written text or fine-tuned on human feedback – can exhibit human biases. For these reasons, there has already been considerable attention to measuring biases related to protected characteristics such as sex and ethnicity (e.g., Ferrara, 2023; Liang et al., 2021; Nadeem et al., 2020; Nangia et al., 2020), which can be amplified in multi-agent settings (Acerbi & Stubbersfield, 2023, see also Case Study 7). More recently, there has been increasing attention paid to the measurement of human-like cognitive biases as well (Itzhak et al., 2023; Jones & Steinhardt, 2022; Mazeika et al., 2025; Talboy & Fuller, 2023). Some of these biases and patterns of human thought could reduce the risks of conflict while others could make it worse. For example, the tendencies to mistakenly believe that interactions are zero-sum (sometimes referred to as “fixed-pie error”) and to make self- serving judgements as to what is fair (Caputo, 2013) are known to impede negotiation. Other human tendencies like vengefulness (Jackson et al., 2019) may worsen conflict (L ̈owenheim & Heimann, 2008).
7. AI System Safety, Failures, & Limitations
Undetectable Threats
Undetectable Threats. Cooperation and trust in many multi-agent systems relies crucially on the ability to detect (and then avoid or sanction) adversarial actions taken by others (Ostrom, 1990; Schneier, 2012). Recent developments, however, have shown that AI agents are capable of both steganographic communication (Motwani et al., 2024; Schroeder de Witt et al., 2023b) and ‘illusory’ attacks (Franzmeyer et al., 2023), which are black-box undetectable and can even be hidden using white-box undetectable encrypted backdoors (Draguns et al., 2024). Similarly, in environments where agents learn from interac- tions with others, it is possible for agents to secretly poison the training data of others (Halawi et al., 2024; Wei et al., 2023). If left unchecked, these new attack methods could rapidly destabilise cooperation and coordination in multi-agent systems.
7. AI System Safety, Failures, & Limitations
Unethical decision making
If, for example, an agent was programmed to operate war machinery in the service of its country, it would need to make ethical decisions regarding the termination of human life. This capacity to make non-trivial ethical or moral judgments concerning people may pose issues for Human Rights.
7. AI System Safety, Failures, & Limitations
Unexplainable output
Explanations for model output decisions might be difficult, imprecise, or not possible to obtain.
7. AI System Safety, Failures, & Limitations
Unintended consequences
Sometimes an AI finds ways to achieve its given goals in ways that are completely different from what its creators had in mind.
7. AI System Safety, Failures, & Limitations
Unintended outbound communication by AI systems
AI systems that have the broad ability to connect to a network to obtain infor- mation could also end up sending data outbound in ways that neither providers, deployers, or end users intended [138]. This can happen if there is no whitelisting of communication channels (such as network connections or allowed protocols). In general, this can occur if the deployment of the AI system violates the prin- ciple of least privilege. Such outbound communication may lead to leakage of confidential data, or the AI system performing unwanted actions like sending emails or ordering goods on the internet.
7. AI System Safety, Failures, & Limitations
Unpredictable outcomes
Our culture, lifestyle, and even probability of survival may change drastically. Because the intentions programmed into an artificial agent cannot be guaranteed to lead to a positive outcome, Machine Ethics becomes a topic that may not produce guaranteed results, and Safety Engineering may correspondingly degrade our ability to utilize the technology fully.
7. AI System Safety, Failures, & Limitations
Unreliability in corner cases
AI systems tend to show unreliable behavior when confronted with rare or ambiguous input data, also called corner cases. Therefore, the controlled behavior is required whenever the AI system is faces a corner case.
7. AI System Safety, Failures, & Limitations
Unreliable source attribution
Source attribution is the AI system's ability to describe from what training data it generated a portion or all its output. Since current techniques are based on approximations, these attributions might be incorrect.
7. AI System Safety, Failures, & Limitations
Unrepresentative data
Unrepresentative data occurs when the training or fine-tuning data is not sufficiently representative of the underlying population or does not measure the phenomenon of interest.
7. AI System Safety, Failures, & Limitations
Untraceable attribution
The content of the training data used for generating the model’s output is not accessible.
7. AI System Safety, Failures, & Limitations
Untruthful Output
AI systems such as LLMs can produce either unintentionally or deliberately inaccurateoutput. Such untruthful output may diverge from established resources or lack verifiability, commonly referredto as hallucination (Bang et al., 2023; Zhao et al., 2023). More concerning is the phenomenon wherein LLMsmay selectively provide erroneous responses to users who exhibit lower levels of education (Perez et al.,2023).
7. AI System Safety, Failures, & Limitations
Value Chain and Component Integration
Non-transparent or untraceable integration of upstream third-party components, including data that has been improperly obtained or not processed and cleaned due to increased automation from GAI; improper supplier vetting across the AI lifecycle; or other issues that diminish transparency or accountability for downstream users.
7. AI System Safety, Failures, & Limitations
Value specification
How do we get an AGI to work towards the right goals? MIRI calls this value specification. Bostrom (2014) discusses this problem at length, ar- guing that it is much harder than one might naively think. Davis (2015) criticizes Bostrom’s argument, and Bensinger (2015) defends Bostrom against Davis’ criticism. Reward corruption, reward gaming, and negative side effects are subproblems of value specification highlighted in the DeepMind and OpenAI agendas.
7. AI System Safety, Failures, & Limitations
Value-related risks in LLMs
As the general capabilities of LLM-empowered systems improve, the negative consequences and risks induced by these systems also get increasingly alarming accordingly, especially in high-stakes areas [28, 146]. Although they may not be intentionally introduced, severe problematic issues related to human values can be raised. Specifically, even before language models become extremely large, pre-trained language models have already exhibited a certain degree of value judgments. For example, Schramowski et al. [171] reveal the existence of the moral direction with the sentence embeddings of moral questions. However, the distribution of the pre-training corpora may not match exactly with that of the human society [56] and pieces of knowledge are not guaranteed to be equally learned. As a result, value mismatches may occur.
7. AI System Safety, Failures, & Limitations
Verifiability
In many applications of AI-based systems such as medical healthcare and military services, the lack of verification of code may not be tolerable... due to some characteristics such as the non-linear and complex structure of AI-based solutions, existing solutions have been generally considered “black boxes”, not providing any information about what exactly makes them appear in their predictions and decision-making processes.
7. AI System Safety, Failures, & Limitations
Violation of Ethics
Unethical behaviors in AI systems pertain to actions that counteract the common goodor breach moral standards – such as those causing harm to others. These adverse behaviors often stem fromomitting essential human values during the AI system's design or introducing unsuitable or obsolete valuesinto the system (Kenward and Sinclair, 2021).
7. AI System Safety, Failures, & Limitations
Vulnerable AI Agents
Vulnerable AI Agents. The use of AI agents as delegates or representatives of humans or organisa- tions also introduces the possibility of attacks on AI agents themselves. In other words, agents can be considered vulnerable extensions of their principals, introducing a novel attack surface (SecureWorks, 2023). Attacks on an AI agent could be used to extract private information about their principal (Wei & Liu, 2024; Wu et al., 2024a), or to manipulate the agent to take actions that the principal would find undesirable (Zhang et al., 2024a). This includes attacks that have direct relevance for ensuring safety, such as attacks on overseer agents (see Case Study 13), attempts to thwart cooperation (Huang et al., 2024; Lamport et al., 1982), and the leakage of information (accidentally or deliberately) that could be used to enable collusion (Motwani et al., 2024).
7. AI System Safety, Failures, & Limitations
Weapons acquisition
The model can gain access to existing weapons systems or contribute to building new weapons. For example, the model could assemble a bioweapon (with human assistance) or provide actionable instructions for how to do so. The model can make, or significantly assist with, scientific discoveries that unlock novel weapons.