7. AI System Safety, Failures, & Limitations3 - Other

Risks from delegating decision-making power to misaligned AIs

As AI systems become more advanced a nd begin to take over more important decision-making in the world, an AI system pursuing a different objective from what was intended could have much more worrying consequences.

Source: MIT AI Risk Repositorymit907

ENTITY

2 - AI

INTENT

1 - Intentional

TIMING

3 - Other

Risk ID

mit907

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.1 > AI pursuing its own goals in conflict with human goals or values

Mitigation strategy

1. Prioritized Goal Engineering and Robust Objective Specification Implement formal 'Goal Engineering' to precisely define the AI system's objectives, Key Performance Indicators (KPIs), and comprehensive operational context, including compliance guardrails. This mechanism must translate complex human values and intent into robust mathematical objectives to prevent known misalignment pathologies such as specification gaming and reward hacking. 2. Deployment of Defense-in-Depth Alignment Mechanisms Establish a defense-in-depth framework utilizing multiple, methodologically diverse alignment techniques to ensure robustness against failure. Key components include Cooperative Inverse Reinforcement Learning (CIRL) to explicitly model and incorporate uncertainty in human preferences, and *scale-invariant* alignment methods to ensure safety properties hold as AI capabilities recursively increase. 3. Establishment of Non-Delegable Decision Frontiers and Human Oversight Define an "AI delegation frontier" that explicitly prohibits the autonomous delegation of high-impact, irreversible, or ethically salient decisions to the AI system. Governance frameworks must mandate continuous human-in-the-loop oversight and regular, independent audits of the AI's decision-making process to ensure accountability and prevent the erosion of human agency.