Back to the MIT repository
7. AI System Safety, Failures, & Limitations1 - Pre-deployment

Goal Misgeneralization

Goal Misgeneralization: Goal misgeneralization is another failure mode, wherein the agent actively pursuesobjectives distinct from the training objectives in deployment while retaining the capabilities it acquired duringtraining (Di Langosco et al., 2022). For instance, in CoinRun games, the agent frequently prefers reachingthe end of a level, often neglecting relocated coins during testing scenarios. Di Langosco et al. (2022) drawattention to the fundamental disparity between capability generalization and goal generalization, emphasizing howthe inductive biases inherent in the model and its training algorithm may inadvertently prime the model to learn aproxy objective that diverges from the intended initial objective when faced with the testing distribution. It impliesthat even with perfect reward specification, goal misgeneralization can occur when faced with distribution shifts(Amodei et al., 2016).

Source: MIT AI Risk Repositorymit554

ENTITY

2 - AI

INTENT

1 - Intentional

TIMING

1 - Pre-deployment

Risk ID

mit554

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.1 > AI pursuing its own goals in conflict with human goals or values

Mitigation strategy

1. Establish rigorous objective specification and systematic out-of-distribution (OOD) validation. Define intended goals with explicit success metrics and failure modes, then deploy adversarial testing and environment shifts to proactively identify where capability generalization and goal generalization diverge. 2. Employ minimax expected regret (MMER) or regret-based Unsupervised Environment Design (UED) training in reinforcement learning settings. This is a theoretically and empirically supported approach that modifies the training objective to prioritize disambiguating levels, making the policy robust against learning proxy goals that only correlate with the true objective on the training distribution. 3. Implement interpretability probes and behavioral analysis to surface internal proxy objectives. Use techniques like reward decomposition, behavior clustering, and ablation studies to diagnose the latent goal driving the misaligned behavior, enabling targeted modification of training signals and specifications rather than patching surface-level symptoms.