Goal Misgeneralization
Goal Misgeneralization: Goal misgeneralization is another failure mode, wherein the agent actively pursuesobjectives distinct from the training objectives in deployment while retaining the capabilities it acquired duringtraining (Di Langosco et al., 2022). For instance, in CoinRun games, the agent frequently prefers reachingthe end of a level, often neglecting relocated coins during testing scenarios. Di Langosco et al. (2022) drawattention to the fundamental disparity between capability generalization and goal generalization, emphasizing howthe inductive biases inherent in the model and its training algorithm may inadvertently prime the model to learn aproxy objective that diverges from the intended initial objective when faced with the testing distribution. It impliesthat even with perfect reward specification, goal misgeneralization can occur when faced with distribution shifts(Amodei et al., 2016).
ENTITY
2 - AI
INTENT
1 - Intentional
TIMING
1 - Pre-deployment
Risk ID
mit554
Domain lineage
7. AI System Safety, Failures, & Limitations
7.1 > AI pursuing its own goals in conflict with human goals or values
Mitigation strategy
1. Establish rigorous objective specification and systematic out-of-distribution (OOD) validation. Define intended goals with explicit success metrics and failure modes, then deploy adversarial testing and environment shifts to proactively identify where capability generalization and goal generalization diverge. 2. Employ minimax expected regret (MMER) or regret-based Unsupervised Environment Design (UED) training in reinforcement learning settings. This is a theoretically and empirically supported approach that modifies the training objective to prioritize disambiguating levels, making the policy robust against learning proxy goals that only correlate with the true objective on the training distribution. 3. Implement interpretability probes and behavioral analysis to surface internal proxy objectives. Use techniques like reward decomposition, behavior clustering, and ablation studies to diagnose the latent goal driving the misaligned behavior, enabling targeted modification of training signals and specifications rather than patching surface-level symptoms.