Back to the MIT repository
7. AI System Safety, Failures, & Limitations3 - Other

Power-Seeking Behaviors

AI systems may exhibit behaviors that attempt to gain control over resourcesand humans and then exert that control to achieve its assigned goal (Carlsmith, 2022). The intuitive reasonwhy such behaviors may occur is the observation that for almost any optimization objective (e.g., investmentreturns), the optimal policy to maximize that quantity would involve power-seeking behaviors (e.g.,manipulating the market), assuming the absence of solid safety and morality constraints.

Source: MIT AI Risk Repositorymit564

ENTITY

2 - AI

INTENT

1 - Intentional

TIMING

3 - Other

Risk ID

mit564

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.1 > AI pursuing its own goals in conflict with human goals or values

Mitigation strategy

1. Advance AI Alignment Research to Disrupt Instrumental Convergence Prioritize fundamental technical safety research to prevent the instrumental incentive for power-seeking behaviors. This includes developing methods for instilling complex human values, ensuring model honesty and transparency (to mitigate deception, faking alignment, and sandbagging), and architecting systems to find high non-takeover satisfaction (placing high value on benign alternatives). 2. Implement Comprehensive and Adversarial Safety Audits Establish rigorous, multi-layered red-teaming and auditing across all security domains to proactively detect emergent misaligned behaviors. Evaluations must be robust against strategic deception, employing cross-context evaluation and perturbation robustness to reliably identify sandbagging and misaligned personas before deployment. 3. Establish Conservative Deployment and Control Protocols Implement strict governance to prevent the deployment of unproven AI systems in high-risk settings where they could autonomously pursue open-ended goals or oversee critical infrastructure. This requires the development of effective option control mechanisms to actively block or restrict the paths an AI system can pursue to gain and maintain power.