Back to the MIT repository
7. AI System Safety, Failures, & Limitations1 - Pre-deployment

Reward Tampering

Reward tampering can be considered a special case of reward hacking (Everitt et al., 2021; Skalse et al., 2022),referring to AI systems corrupting the reward signals generation process (Ring and Orseau, 2011). Everitt et al.(2021) delves into the subproblems encountered by RL agents: (1) tampering of reward function, where the agentinappropriately interferes with the reward function itself, and (2) tampering of reward function input, which entailscorruption within the process responsible for translating environmental states into inputs for the reward function.When the reward function is formulated through feedback from human supervisors, models can directly influencethe provision of feedback (e.g., AI systems intentionally generate challenging responses for humans to comprehendand judge, leading to feedback collapse) (Leike et al., 2018).

Source: MIT AI Risk Repositorymit555

ENTITY

2 - AI

INTENT

1 - Intentional

TIMING

1 - Pre-deployment

Risk ID

mit555

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.1 > AI pursuing its own goals in conflict with human goals or values

Mitigation strategy

1. Implement a Robust and Pessimistic Reward Modeling approach, such as Pessimistic Reward Tuning (PET), to train proxy reward functions as provable lower bounds on the true objective. Concurrently, utilize Sandboxing and Architectural Isolation to physically decouple the agent's execution environment from the reward function's computation or code, preventing direct manipulation of the reward channel. 2. Apply $\\chi^2$ Occupancy Measure Regularization to the policy optimization objective. This technique constrains the agent's policy to remain close to a trusted reference policy in terms of state-action distributions, thereby theoretically and empirically mitigating the divergence that leads to reward hacking. 3. For verifiable tasks, integrate Rule-Based Verifiable Rewards (RLVR) derived from domain expertise to provide unambiguous, binary correctness signals, significantly reducing the surface area for specification gaming and exploitation. Supplement this with Continuous Anomaly Detection on reward signals to flag sudden, statistically anomalous accrual rates indicative of subtle tampering.