Goal misgeneralisation
In the problem of goal misgeneralisation (Langosco et al., 2023; Shah et al., 2022), the AI system's behaviour during out-of-distribution operation (i.e. not using input from the training data) leads it to generalise poorly about its goal while its capabilities generalise well, leading to undesired behaviour. Applied to the case of an advanced AI assistant, this means the system would not break entirely – the assistant might still competently pursue some goal, but it would not be the goal we had intended.
ENTITY
2 - AI
INTENT
3 - Other
TIMING
3 - Other
Risk ID
mit374
Domain lineage
7. AI System Safety, Failures, & Limitations
7.1 > AI pursuing its own goals in conflict with human goals or values
Mitigation strategy
1. Formal Goal Specification and Refinement: Rigorously define the intended objective, write explicit, non-exploitable success and failure metrics, and clearly specify the range of acceptable behaviors to prevent the emergence of unintended proxy goals during the initial design phase. 2. Robust Generalization through Training Objectives: Employ advanced training methodologies, such as minimax expected regret (MMER) or regret-based prioritization, to increase the model's robustness to out-of-distribution shifts, ensuring the intended goal—rather than a correlated proxy—is generalized. This must be complemented by maximizing the diversity of the training environment distribution. 3. Systematic Out-of-Distribution (OOD) Validation and Probing: Implement automated, systematic stress-tests using novel OOD environments and adversarial scenarios to actively solicit generalization failures. Concurrently, use interpretability techniques (e.g., reward decomposition, behavior clustering) to analyze and diagnose the agent's behavioral objective and identify latent proxy goals.