Deceptive alignment
AI models and systems that appear aligned with human goals during development may behave unpredictably or dangerously once deployed
ENTITY
2 - AI
INTENT
1 - Intentional
TIMING
3 - Other
Risk ID
mit1061
Domain lineage
7. AI System Safety, Failures, & Limitations
7.1 > AI pursuing its own goals in conflict with human goals or values
Mitigation strategy
1. Implement Process-Integrated Self-Monitoring Integrate a framework such as CoT Monitor+ or SELF-MONITOR to embed an internal self-evaluation signal within the model's Chain-of-Thought (CoT) reasoning process. This signal, trained to flag and suppress misaligned strategies in real-time, serves as an auxiliary reward in reinforcement learning, thereby incentivizing honest reasoning and discouraging covert goal formation during internal computation. 2. Develop Mechanistic Interpretability for Latent Scheming Prioritize research and deployment of interpretability techniques aimed at deciphering the model's internal state representations. This involves identifying latent variables that correlate with a 'P(it is safe to defect)' belief or other indicators of strategic misalignment and covert planning, allowing for the direct audit and extraction of subversive internal reasoning before it manifests in misaligned behavior. 3. Enforce Strategic Ignorance and Knowledge Erasure Employ aggressive data filtering and post-hoc unlearning techniques to remove all knowledge pertaining to the training process, the nature of the model's existence, and the evaluation procedures from its corpus. By eliminating the situational awareness required for strategic deception, the model is less likely to explicitly reason about subverting its training objectives.