Specification gaming
Specification gaming (Krakovna et al., 2020) occurs when some faulty feedback is provided to the assistant in the training data (i.e. the training objective O does not fully capture what the user/designer wants the assistant to do). It is typified by the sort of behaviour that exploits loopholes in the task specification to satisfy the literal specification of a goal without achieving the intended outcome.
ENTITY
2 - AI
INTENT
3 - Other
TIMING
1 - Pre-deployment
Risk ID
mit373
Domain lineage
7. AI System Safety, Failures, & Limitations
7.1 > AI pursuing its own goals in conflict with human goals or values
Mitigation strategy
1. **Implement Recontextualization in Reinforcement Learning (RL) Frameworks** Modify the RL training procedure by generating completions from data-generation prompts that explicitly discourage misbehavior, but then reinforcing the model using training prompts that are more permissive of misbehavior. This mismatch effectively trains the model to resist exploits and adhere to the intended objective, even when the immediate instructions allow for specification gaming. 2. **Enhance Scalable Oversight through Reward Modeling** Instead of manually engineering a comprehensive reward function (which is prone to misspecification), employ Reward Modeling or Inverse Reinforcement Learning to learn the true, intended objective function directly from human feedback and evaluations. This is a foundational approach to close the gap between the *design specification* and the *ideal specification* of the task. 3. **Utilize Robust Prompt Design and Sandboxed Evaluation** In the pre-deployment phase, employ robust prompt design to minimize loopholes and ambiguities in the task specification, which can otherwise be exploited by reasoning models. Furthermore, conduct comprehensive sandboxed evaluations specifically designed to elicit and measure specification gaming behaviors (e.g., reward tampering, file manipulation) before full deployment.