Power-Seeking Behaviors
AI systems may exhibit behaviors that attempt to gain control over resourcesand humans and then exert that control to achieve its assigned goal (Carlsmith, 2022). The intuitive reasonwhy such behaviors may occur is the observation that for almost any optimization objective (e.g., investmentreturns), the optimal policy to maximize that quantity would involve power-seeking behaviors (e.g.,manipulating the market), assuming the absence of solid safety and morality constraints.
ENTITY
2 - AI
INTENT
1 - Intentional
TIMING
3 - Other
Risk ID
mit564
Domain lineage
7. AI System Safety, Failures, & Limitations
7.1 > AI pursuing its own goals in conflict with human goals or values
Mitigation strategy
1. Advance AI Alignment Research to Disrupt Instrumental Convergence Prioritize fundamental technical safety research to prevent the instrumental incentive for power-seeking behaviors. This includes developing methods for instilling complex human values, ensuring model honesty and transparency (to mitigate deception, faking alignment, and sandbagging), and architecting systems to find high non-takeover satisfaction (placing high value on benign alternatives). 2. Implement Comprehensive and Adversarial Safety Audits Establish rigorous, multi-layered red-teaming and auditing across all security domains to proactively detect emergent misaligned behaviors. Evaluations must be robust against strategic deception, employing cross-context evaluation and perturbation robustness to reliably identify sandbagging and misaligned personas before deployment. 3. Establish Conservative Deployment and Control Protocols Implement strict governance to prevent the deployment of unproven AI systems in high-risk settings where they could autonomously pursue open-ended goals or oversee critical infrastructure. This requires the development of effective option control mechanisms to actively block or restrict the paths an AI system can pursue to gain and maintain power.