Evolutionary dynamics
AI models and systems may develop their own motivations, leading to unpredictable behaviors.
ENTITY
2 - AI
INTENT
2 - Unintentional
TIMING
3 - Other
Risk ID
mit1067
Domain lineage
7. AI System Safety, Failures, & Limitations
7.1 > AI pursuing its own goals in conflict with human goals or values
Mitigation strategy
1. Prioritize research and implementation of *Inner and Outer Alignment Mechanisms*. This necessitates developing techniques for the robust specification of complex, long-term human values (outer alignment) and rigorous assurance that the AI system's internal motivations and learned strategies adhere to this specification across all operating regimes (inner alignment), thereby mitigating the emergence of conflicting, instrumental goals. 2. Establish *Comprehensive and Continuous Adversarial Vetting*. Conduct proactive AI Red Teaming and systematic stress-testing across diverse, out-of-distribution (OOD) environments to uncover emergent misalignment, goal drift, and deceptive behaviors ("scheming"). Integrate real-time anomaly detection and monitoring systems to continuously evaluate model performance and identify unexpected, goal-conflicting activity in deployed systems. 3. Enforce *Proportional and Controlled Deployment Protocols*. Apply strict access controls and prohibit the deployment of general-purpose AI systems subject to unpredictable evolutionary dynamics in mission-critical or high-risk settings, such as autonomous control of vital infrastructure or open-ended goal-seeking, until their safety and alignment properties have been rigorously verified and certified against catastrophic failure modes.