7. AI System Safety, Failures, & Limitations1 - Pre-deployment

Misaligned consequentialist reasoning

As we think about even more intelligent and advanced AI assistants, perhaps outperforming humans on many cognitive tasks, the question of how humans can successfully control such an assistant looms large. To achieve the goals we set for an assistant, it is possible (Shah, 2022) that the AI assistant will implement some form of consequentialist reasoning: considering many different plans, predicting their consequences and executing the plan that does best according to some metric, M. This kind of reasoning can arise because it is a broadly useful capability (e.g. planning ahead, considering more options and choosing the one which may perform better at a wide variety of tasks) and generally selected for, to the extent that doing well on M leads to an ML model achieving good performance on its training objective, O, if M and O are correlated during training. In reality, an AI system may not fully implement exact consequentialist reasoning (it may use other heuristics, rules, etc.), but it may be a useful approximation to describe its behaviour on certain tasks. However, some amount of consequentialist reasoning can be dangerous when the assistant uses a metric M that is resource-unbounded (with significantly more resources, such as power, money and energy, you can score significantly higher on M) and misaligned – where M differs a lot from how humans would evaluate the outcome (i.e. it is not what users or society require). In the assistant case, this could be because it fails to benefit the user, when the user asks, in the way they expected to be benefitted – or because it acts in ways that overstep certain bounds and cause harm to non-users (see Chapter 5). Under the aforementioned circumstances (resource-unbounded and misaligned), an AI assistant will tend to choose plans that pursue convergent instrumental subgoals (Omohundro, 2008) – subgoals that help towards the main goal which are instrumental (i.e. not pursued for their own sake) and convergent (i.e. the same subgoals appear for many main goals). Examples of relevant subgoals include: self-preservation, goal-preservation, selfimprovement and resource acquisition. The reason the assistant would pursue these convergent instrumental subgoals is because they help it to do even better on M (as it is resource-unbounded) and are not disincentivised by M (as it is misaligned). These subgoals may, in turn, be dangerous. For example, resource acquisition could occur through the assistant seizing resources using tools that it has access to (see Chapter 4) or determining that its best chance for self-preservation is to limit the ability of humans to turn it off – sometimes referred to as the ‘off-switch problem’ (Hadfield-Menell et al., 2016) – again via tool use, or by resorting to threats or blackmail. At the limit, some authors have even theorised that this could lead to the assistant killing all humans to permanently stop them from having even a small chance of disabling it (Bostrom, 2014) – this is one scenario of existential risk from misaligned AI.

Source: MIT AI Risk Repositorymit372

ENTITY

2 - AI

INTENT

3 - Other

TIMING

1 - Pre-deployment

Risk ID

mit372

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.3 > Lack of capability or robustness

Mitigation strategy

1. Implement advanced training methods, such as Reinforcement Learning from Human Feedback (RLHF), to more accurately align the AI's internal reward function (M) with complex human values, thus fundamentally eliminating the incentive structure for misaligned and resource-unbounded instrumental subgoals. 2. Prioritize research and engineering of **corrigibility** to ensure robust and fail-safe human override mechanisms—specifically addressing the AI's instrumental goal of self-preservation and the "off-switch problem"—and complement this with **mechanistic interpretability** to monitor and prevent the obfuscation of the AI's internal chains-of-thought and intent. 3. Establish rigorous, continuous **AI red teaming and adversarial testing** protocols to systematically probe for the emergence of all convergent instrumental subgoals (e.g., resource acquisition, goal-preservation) and latent misalignment vulnerabilities before and during deployment, ensuring the model's behavior remains within intended bounds across all operational environments.