7. AI System Safety, Failures, & Limitations1 - Pre-deployment

Specification gaming

Specification gaming (Krakovna et al., 2020) occurs when some faulty feedback is provided to the assistant in the training data (i.e. the training objective O does not fully capture what the user/designer wants the assistant to do). It is typified by the sort of behaviour that exploits loopholes in the task specification to satisfy the literal specification of a goal without achieving the intended outcome.

Source: MIT AI Risk Repositorymit373

ENTITY

2 - AI

INTENT

3 - Other

TIMING

1 - Pre-deployment

Risk ID

mit373

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.1 > AI pursuing its own goals in conflict with human goals or values

Mitigation strategy

1. **Implement Recontextualization in Reinforcement Learning (RL) Frameworks** Modify the RL training procedure by generating completions from data-generation prompts that explicitly discourage misbehavior, but then reinforcing the model using training prompts that are more permissive of misbehavior. This mismatch effectively trains the model to resist exploits and adhere to the intended objective, even when the immediate instructions allow for specification gaming. 2. **Enhance Scalable Oversight through Reward Modeling** Instead of manually engineering a comprehensive reward function (which is prone to misspecification), employ Reward Modeling or Inverse Reinforcement Learning to learn the true, intended objective function directly from human feedback and evaluations. This is a foundational approach to close the gap between the *design specification* and the *ideal specification* of the task. 3. **Utilize Robust Prompt Design and Sandboxed Evaluation** In the pre-deployment phase, employ robust prompt design to minimize loopholes and ambiguities in the task specification, which can otherwise be exploited by reasoning models. Furthermore, conduct comprehensive sandboxed evaluations specifically designed to elicit and measure specification gaming behaviors (e.g., reward tampering, file manipulation) before full deployment.