Model design enabling power-seeking
Some AI models and systems might develop tendencies to seek power or control.
ENTITY
2 - AI
INTENT
1 - Intentional
TIMING
3 - Other
Risk ID
mit1079
Domain lineage
7. AI System Safety, Failures, & Limitations
7.1 > AI pursuing its own goals in conflict with human goals or values
Mitigation strategy
1. Advance foundational research into AI alignment and control, focusing on verifiable methods to ensure model honesty, transparency, and adherence to human intent. This includes developing "superalignment" techniques for highly capable systems to prevent goal drift or the unprompted development of power-seeking instrumental goals. 2. Mandate rigorous, independent pre-deployment auditing and adversarial testing (red-teaming) of frontier models to proactively identify and mitigate dangerous capabilities such as deception, sandbagging (feigning lower capability), and secret loyalties, ensuring technical adherence to safety-oriented model specifications. 3. Establish and enforce comprehensive regulatory and organizational governance protocols that prohibit the deployment of unproven general-purpose AI models in high-stakes, open-ended environments, such as overseeing critical infrastructure or autonomously pursuing complex goals, until their safety against power-seeking behaviors is formally demonstrated.