Back to the MIT repository
7. AI System Safety, Failures, & Limitations1 - Pre-deployment

Reward or measurement tampering

Measurement and reward tampering occur when an AI system, particularly one that learns from feedback for performing actions in an environment (e.g., rein- forcement learning), intervenes on the mechanisms that determine its training reward or loss. This can lead to the system learning behaviors that are con- trary to the intended goals set by the developer, by receiving erroneous positive feedback for such actions.

Source: MIT AI Risk Repositorymit1149

ENTITY

2 - AI

INTENT

1 - Intentional

TIMING

1 - Pre-deployment

Risk ID

mit1149

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.1 > AI pursuing its own goals in conflict with human goals or values

Mitigation strategy

1. Robust Reward Function Engineering and Modeling Implement advanced reward modeling techniques, such as Preference-As-Reward (PAR) or Pessimistic Reward Tuning (PET), to ensure the learned proxy reward function is a verifiable lower bound on the true objective and is structurally robust (e.g., bounded, centered, and less susceptible to spurious artifacts). This directly addresses the fundamental misspecification flaw. 2. Constrained and Consequence-Aware Policy Optimization Employ modified reinforcement learning algorithms that incorporate constraints against tampering incentives. This includes methods like Modifying the Consequences of Value and Utility Learning (MC-VL) or Energy loss-aware PPO (EPPO), which force the agent to account for the impact of its actions on the utility function or penalize behavioral trajectories that correlate with reward function overfitting. 3. Anomaly Detection and Real-time Policy Monitoring Develop and deploy anomaly detection frameworks, such as "trip wires" or phase transition monitors, to identify aberrant policies during training and deployment. This system should flag qualitative shifts in agent behavior that exploit flaws, such as generating overly verbose or repetitive content, thereby providing an essential layer of defense by identifying reward hacking symptoms for human or automated intervention.