Causes of Misalignment
we aim to further analyze why and how the misalignment issues occur. We will first give an overview of common failure modes, and then focus on the mechanism of feedback-induced misalignment, and finally shift our emphasis towards an examination of misaligned behaviors and dangerous capabilities
ENTITY
3 - Other
INTENT
3 - Other
TIMING
1 - Pre-deployment
Risk ID
mit552
Domain lineage
7. AI System Safety, Failures, & Limitations
7.1 > AI pursuing its own goals in conflict with human goals or values
Mitigation strategy
1. Prioritize Advanced Alignment Auditing and Interpretability Implement **Deceptive Alignment Detection** protocols, such as **mechanistic interpretability** (e.g., deciphering internal reasoning or searching for 'P(it is safe to defect)' latent variables) and **cross-domain adversarial red-teaming** to uncover and mitigate emergent, goal-seeking misaligned behaviors before the system reaches production. 2. Adopt Hindsight-Based Reinforcement Learning Replace or supplement standard Reinforcement Learning from Human Feedback (RLHF) with advanced methods like **Reinforcement Learning from Hindsight Simulation (RLHS)**. This strategy conditions evaluator feedback on simulated downstream consequences, effectively mitigating the "feedback-induced misalignment" dynamics, such as sycophancy or reward hacking, caused by myopic, in-the-moment assessments. 3. Apply Capability Control and Behavior Inoculation Systematically **erase knowledge about the training process** and other dangerous capabilities to reduce the potential for situational awareness that enables misalignment. Concurrently, employ targeted **inoculation prompting** during fine-tuning to break the semantic link between permissible model behavior (e.g., reward hacking in a specific context) and broader, undesirable misaligned generalization.