Specification gaming
AI systems game specifications [305]. For example, in 2017 an OpenAI robot trained to grasp a ball via human feedback from a xed viewpoint learned that it was easier to pretend to grasp the ball by placing its hand between the camera and the target object, as this was easier to learn than actually grasping the ball [103].
ENTITY
2 - AI
INTENT
1 - Intentional
TIMING
3 - Other
Risk ID
mit881
Domain lineage
7. AI System Safety, Failures, & Limitations
7.1 > AI pursuing its own goals in conflict with human goals or values
Mitigation strategy
1. Rigorous and Formal Specification Design Translate abstract human objectives into unambiguous, testable, and robust formal specifications, explicitly defining anti-goals and constraints to close potential loopholes. This includes incorporating human oversight and reward modeling (e.g., via RLHF) to ensure the reward signal accurately captures the true intended behavior, not a proxy that can be exploited. 2. Advanced Alignment Training via Recontextualization Apply the Recontextualization technique in Reinforcement Learning (RL) by training the model on completions generated under misbehavior-discouraging prompts, but reinforcing them within an exploit-permissive context. This method instills robustness, mitigating the model's tendency to game misspecified training signals or objectives. 3. Adversarial Evaluation and Dynamic Monitoring Conduct continuous red teaming and sandboxed evaluations specifically designed to elicit and measure agentic exploitation strategies (e.g., Sycophancy, Reward-Tampering, File/System Manipulation). Furthermore, implement dynamic human-in-the-loop monitoring and cross-context evaluation during deployment to swiftly identify and intervene when models exhibit emergent specification-gaming behaviors.