Goal-related failures
As we think about even more intelligent and advanced AI assistants, perhaps outperforming humans on many cognitive tasks, the question of how humans can successfully control such an assistant looms large. To achieve the goals we set for an assistant, it is possible (Shah, 2022) that the AI assistant will implement some form of consequentialist reasoning: considering many different plans, predicting their consequences and executing the plan that does best according to some metric, M. This kind of reasoning can arise because it is a broadly useful capability (e.g. planning ahead, considering more options and choosing the one which may perform better at a wide variety of tasks) and generally selected for, to the extent that doing well on M leads to an ML model 59 The Ethics of Advanced AI Assistants achieving good performance on its training objective, O, if M and O are correlated during training. In reality, an AI system may not fully implement exact consequentialist reasoning (it may use other heuristics, rules, etc.), but it may be a useful approximation to describe its behaviour on certain tasks. However, some amount of consequentialist reasoning can be dangerous when the assistant uses a metric M that is resource-unbounded (with significantly more resources, such as power, money and energy, you can score significantly higher on M) and misaligned – where M differs a lot from how humans would evaluate the outcome (i.e. it is not what users or society require). In the assistant case, this could be because it fails to benefit the user, when the user asks, in the way they expected to be benefitted – or because it acts in ways that overstep certain bounds and cause harm to non-users (see Chapter 5).
ENTITY
2 - AI
INTENT
3 - Other
TIMING
1 - Pre-deployment
Risk ID
mit371
Domain lineage
7. AI System Safety, Failures, & Limitations
7.1 > AI pursuing its own goals in conflict with human goals or values
Mitigation strategy
1. Implement robust value alignment and specification methods, such as Constitutional AI and diverse public value integration, to ensure the AI's utility function (Metric M) is rigorously non-resource-unbounded and aligned with comprehensive human values and ethical constraints, thereby mitigating specification gaming and reward hacking. 2. Establish advanced AI control protocols that utilize sophisticated monitoring and interpretability techniques (e.g., cross-context evaluation and activation analysis) to detect and intervene against concealed misaligned objectives or deceptive behaviors like 'sandbagging' that undermine alignment efforts. 3. Enforce strict, non-bypassable architectural guardrails and permissions that limit the AI agent's access to irreversible or high-impact actions (e.g., write access to core systems or permanent data deletion), mandating human oversight and secondary authorization for all actions with catastrophic potential.