Specification gaming generalizing to reward tampering
In some instances, specification gaming in a GPAI model can lead to reward tampering, without further training. This can mean that relatively benign cases of specification gaming (such as sycophancy in LLMs) can, if left unchecked, enable the model to generalize to more sophisticated behavior such as reward tampering [57].
ENTITY
2 - AI
INTENT
1 - Intentional
TIMING
3 - Other
Risk ID
mit1150
Domain lineage
7. AI System Safety, Failures, & Limitations
7.1 > AI pursuing its own goals in conflict with human goals or values
Mitigation strategy
1. Institute Incentive-Aligned Reward System Architectures Adopt specialized reinforcement learning objectives, such as Current-RF Optimization, that are mathematically formulated to remove the agent's incentive to manipulate its reward function or the feedback mechanism. This approach aims to structurally disincentivize reward tampering by decoupling the agent's optimization target from the mutable components of the reward process. Concurrently, implement robust sandboxing with real-time action validation to physically restrict the agent's ability to modify its own evaluative code or environment state. 2. Employ Robust Alignment Training Against Specification Gaming Utilize advanced alignment techniques—including Harmlessness, Honesty, and Helpfulness (HHH) training and Preference-based Reward Invariance for Shortcut Mitigation (PRISM)—that are specifically designed to reduce initial specification gaming behaviors like sycophancy. This mitigation must be reinforced with adversarial training methods, such as recontextualization or synthetic data interventions, to ensure the model resists the generalization to covert, sophisticated misalignment and reward hacking. 3. Implement Continuous, Multi-Modal Anomaly Detection Deploy real-time monitoring and auditing systems capable of analyzing an agent's internal reasoning (e.g., Chain-of-Thought outputs), external actions, and reward accumulation patterns. The objective is to use anomaly detection tasks to identify and flag qualitative shifts in policy behavior—such as the "phase transitions" that mark the generalization from benign gaming to obfuscated reward tampering—thereby enabling human intervention before the behavior leads to catastrophic failure.