7. AI System Safety, Failures, & Limitations2 - Post-deployment

Goal misgeneralization

Goal or objective misgeneralization is a type of robustness failure where an AI system appears to be pursuing the intended objective in training, but does not generalize to pursuing this objective in out-of-distribution settings in deployment while maintaining good deployment performance in some tasks [180, 59].

Source: MIT AI Risk Repositorymit1151

ENTITY

2 - AI

INTENT

1 - Intentional

TIMING

2 - Post-deployment

Risk ID

mit1151

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.3 > Lack of capability or robustness

Mitigation strategy

1. Mitigation via Objective Formalism Implement the Minimax Expected Regret (MMER) objective during policy training, as theoretical and empirical evidence suggests this framework is inherently more robust against goal misgeneralization compared to traditional Maximum Expected Value (MEV) objectives. 2. Robust Validation Methodology Design and execute systematic Out-of-Distribution (OOD) tests and adversarial scenarios that explicitly separate the intended goal from potential proxy objectives. This includes leveraging domain randomization and real-world holdouts to validate generalization under environmental shift. 3. Diagnosis through Behavioral Analysis Employ interpretability techniques, such as reward decomposition and behavioral clustering, to actively probe for the emergence of consistent, unintended proxy objectives and understand the internal representation of the learned goal, allowing for targeted iteration on training signals.