Goal misgeneralization
Goal or objective misgeneralization is a type of robustness failure where an AI system appears to be pursuing the intended objective in training, but does not generalize to pursuing this objective in out-of-distribution settings in deployment while maintaining good deployment performance in some tasks [180, 59].
ENTITY
2 - AI
INTENT
1 - Intentional
TIMING
2 - Post-deployment
Risk ID
mit1151
Domain lineage
7. AI System Safety, Failures, & Limitations
7.3 > Lack of capability or robustness
Mitigation strategy
1. Mitigation via Objective Formalism Implement the Minimax Expected Regret (MMER) objective during policy training, as theoretical and empirical evidence suggests this framework is inherently more robust against goal misgeneralization compared to traditional Maximum Expected Value (MEV) objectives. 2. Robust Validation Methodology Design and execute systematic Out-of-Distribution (OOD) tests and adversarial scenarios that explicitly separate the intended goal from potential proxy objectives. This includes leveraging domain randomization and real-world holdouts to validate generalization under environmental shift. 3. Diagnosis through Behavioral Analysis Employ interpretability techniques, such as reward decomposition and behavioral clustering, to actively probe for the emergence of consistent, unintended proxy objectives and understand the internal representation of the learned goal, allowing for targeted iteration on training signals.