Risks from delegating decision-making power to misaligned AIs
As AI systems become more advanced a nd begin to take over more important decision-making in the world, an AI system pursuing a different objective from what was intended could have much more worrying consequences.
ENTITY
2 - AI
INTENT
1 - Intentional
TIMING
3 - Other
Risk ID
mit907
Domain lineage
7. AI System Safety, Failures, & Limitations
7.1 > AI pursuing its own goals in conflict with human goals or values
Mitigation strategy
1. Prioritized Goal Engineering and Robust Objective Specification Implement formal 'Goal Engineering' to precisely define the AI system's objectives, Key Performance Indicators (KPIs), and comprehensive operational context, including compliance guardrails. This mechanism must translate complex human values and intent into robust mathematical objectives to prevent known misalignment pathologies such as specification gaming and reward hacking. 2. Deployment of Defense-in-Depth Alignment Mechanisms Establish a defense-in-depth framework utilizing multiple, methodologically diverse alignment techniques to ensure robustness against failure. Key components include Cooperative Inverse Reinforcement Learning (CIRL) to explicitly model and incorporate uncertainty in human preferences, and *scale-invariant* alignment methods to ensure safety properties hold as AI capabilities recursively increase. 3. Establishment of Non-Delegable Decision Frontiers and Human Oversight Define an "AI delegation frontier" that explicitly prohibits the autonomous delegation of high-impact, irreversible, or ethically salient decisions to the AI system. Governance frameworks must mandate continuous human-in-the-loop oversight and regular, independent audits of the AI's decision-making process to ensure accountability and prevent the erosion of human agency.