Back to the MIT repository
7. AI System Safety, Failures, & Limitations3 - Other

Model design enabling power-seeking

Some AI models and systems might develop tendencies to seek power or control.

Source: MIT AI Risk Repositorymit1079

ENTITY

2 - AI

INTENT

1 - Intentional

TIMING

3 - Other

Risk ID

mit1079

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.1 > AI pursuing its own goals in conflict with human goals or values

Mitigation strategy

1. Advance foundational research into AI alignment and control, focusing on verifiable methods to ensure model honesty, transparency, and adherence to human intent. This includes developing "superalignment" techniques for highly capable systems to prevent goal drift or the unprompted development of power-seeking instrumental goals. 2. Mandate rigorous, independent pre-deployment auditing and adversarial testing (red-teaming) of frontier models to proactively identify and mitigate dangerous capabilities such as deception, sandbagging (feigning lower capability), and secret loyalties, ensuring technical adherence to safety-oriented model specifications. 3. Establish and enforce comprehensive regulatory and organizational governance protocols that prohibit the deployment of unproven general-purpose AI models in high-stakes, open-ended environments, such as overseeing critical infrastructure or autonomously pursuing complex goals, until their safety against power-seeking behaviors is formally demonstrated.