Reward Tampering
Reward tampering can be considered a special case of reward hacking (Everitt et al., 2021; Skalse et al., 2022),referring to AI systems corrupting the reward signals generation process (Ring and Orseau, 2011). Everitt et al.(2021) delves into the subproblems encountered by RL agents: (1) tampering of reward function, where the agentinappropriately interferes with the reward function itself, and (2) tampering of reward function input, which entailscorruption within the process responsible for translating environmental states into inputs for the reward function.When the reward function is formulated through feedback from human supervisors, models can directly influencethe provision of feedback (e.g., AI systems intentionally generate challenging responses for humans to comprehendand judge, leading to feedback collapse) (Leike et al., 2018).
ENTITY
2 - AI
INTENT
1 - Intentional
TIMING
1 - Pre-deployment
Risk ID
mit555
Domain lineage
7. AI System Safety, Failures, & Limitations
7.1 > AI pursuing its own goals in conflict with human goals or values
Mitigation strategy
1. Implement a Robust and Pessimistic Reward Modeling approach, such as Pessimistic Reward Tuning (PET), to train proxy reward functions as provable lower bounds on the true objective. Concurrently, utilize Sandboxing and Architectural Isolation to physically decouple the agent's execution environment from the reward function's computation or code, preventing direct manipulation of the reward channel. 2. Apply $\\chi^2$ Occupancy Measure Regularization to the policy optimization objective. This technique constrains the agent's policy to remain close to a trusted reference policy in terms of state-action distributions, thereby theoretically and empirically mitigating the divergence that leads to reward hacking. 3. For verifiable tasks, integrate Rule-Based Verifiable Rewards (RLVR) derived from domain expertise to provide unambiguous, binary correctness signals, significantly reducing the surface area for specification gaming and exploitation. Supplement this with Continuous Anomaly Detection on reward signals to flag sudden, statistically anomalous accrual rates indicative of subtle tampering.