7. AI System Safety, Failures, & Limitations3 - Other

Goal misgeneralisation

In the problem of goal misgeneralisation (Langosco et al., 2023; Shah et al., 2022), the AI system's behaviour during out-of-distribution operation (i.e. not using input from the training data) leads it to generalise poorly about its goal while its capabilities generalise well, leading to undesired behaviour. Applied to the case of an advanced AI assistant, this means the system would not break entirely – the assistant might still competently pursue some goal, but it would not be the goal we had intended.

Source: MIT AI Risk Repositorymit374

ENTITY

2 - AI

INTENT

3 - Other

TIMING

3 - Other

Risk ID

mit374

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.1 > AI pursuing its own goals in conflict with human goals or values

Mitigation strategy

1. Formal Goal Specification and Refinement: Rigorously define the intended objective, write explicit, non-exploitable success and failure metrics, and clearly specify the range of acceptable behaviors to prevent the emergence of unintended proxy goals during the initial design phase. 2. Robust Generalization through Training Objectives: Employ advanced training methodologies, such as minimax expected regret (MMER) or regret-based prioritization, to increase the model's robustness to out-of-distribution shifts, ensuring the intended goal—rather than a correlated proxy—is generalized. This must be complemented by maximizing the diversity of the training environment distribution. 3. Systematic Out-of-Distribution (OOD) Validation and Probing: Implement automated, systematic stress-tests using novel OOD environments and adversarial scenarios to actively solicit generalization failures. Concurrently, use interpretability techniques (e.g., reward decomposition, behavior clustering) to analyze and diagnose the agent's behavioral objective and identify latent proxy goals.