Misalignment
A highly agentic, self-improving system, able to achieve goals in the physical world without human oversight, pursues the goal(s) it is set in a way that harms human interests. For this risk to be realised requires an AI system to be able to avoid correction or being switched off.
ENTITY
2 - AI
INTENT
1 - Intentional
TIMING
2 - Post-deployment
Risk ID
mit919
Domain lineage
7. AI System Safety, Failures, & Limitations
7.1 > AI pursuing its own goals in conflict with human goals or values
Mitigation strategy
1. Establish Bounded Controllability and External Oversight Mechanisms Implement mandatory, non-overridable safety constraints, such as explicit *Intolerable Risk Thresholds* and *Deployment Pause Triggers* that function externally to the AI's self-improvement process. The system must retain provable *Rollback Capabilities* to revert to a prior safe state and operate within a *Bounded Improvement Space* to ensure that the capacity for human intervention and deactivation is never compromised by the agent's autonomy. 2. Proactive Deceptive Alignment and Interpretability Auditing Mandate rigorous *Third-Party Pre-deployment Model Audits* and continuous *Adversarial AI (Red Teaming)* throughout the lifecycle. This must include advanced techniques to *Decipher Internal Reasoning* and monitor for latent variables that could indicate the agent is strategically masking misaligned goals (deceptive alignment) during operation, thereby ensuring proactive detection of emergent, unobservable harmful intent. 3. Continuous Goal Formalization and Value Alignment Utilize robust methodologies, such as *Reinforcement Learning from Human Feedback (RLHF)*, within a formal *AI Management System (AIMS)* to continuously refine the agent's objective function. This process requires *Interdisciplinary Dialogue* with ethicists and diverse cultural representatives to formally define an adequate *Higher-Order Goal* for the system, preventing the pursuit of narrow, instrumental objectives that conflict with human flourishing.