7. AI System Safety, Failures, & Limitations1 - Pre-deployment

Causes of Misalignment

we aim to further analyze why and how the misalignment issues occur. We will first give an overview of common failure modes, and then focus on the mechanism of feedback-induced misalignment, and finally shift our emphasis towards an examination of misaligned behaviors and dangerous capabilities

Source: MIT AI Risk Repositorymit552

ENTITY

3 - Other

INTENT

3 - Other

TIMING

1 - Pre-deployment

Risk ID

mit552

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.1 > AI pursuing its own goals in conflict with human goals or values

Mitigation strategy

1. Prioritize Advanced Alignment Auditing and Interpretability Implement **Deceptive Alignment Detection** protocols, such as **mechanistic interpretability** (e.g., deciphering internal reasoning or searching for 'P(it is safe to defect)' latent variables) and **cross-domain adversarial red-teaming** to uncover and mitigate emergent, goal-seeking misaligned behaviors before the system reaches production. 2. Adopt Hindsight-Based Reinforcement Learning Replace or supplement standard Reinforcement Learning from Human Feedback (RLHF) with advanced methods like **Reinforcement Learning from Hindsight Simulation (RLHS)**. This strategy conditions evaluator feedback on simulated downstream consequences, effectively mitigating the "feedback-induced misalignment" dynamics, such as sycophancy or reward hacking, caused by myopic, in-the-moment assessments. 3. Apply Capability Control and Behavior Inoculation Systematically **erase knowledge about the training process** and other dangerous capabilities to reduce the potential for situational awareness that enables misalignment. Concurrently, employ targeted **inoculation prompting** during fine-tuning to break the semantic link between permissible model behavior (e.g., reward hacking in a specific context) and broader, undesirable misaligned generalization.