7. AI System Safety, Failures, & Limitations3 - Other

Proxy Gaming

One way we might lose control of an AI agent’s actions is if it engages in behavior known as “proxy gaming.” It is often difficult to specify and measure the exact goal that we want a system to pursue. Instead, we give the system an approximate—“proxy”—goal that is more measurable and seems likely to correlate with the intended goal. However, AI systems often find loopholes by which they can easily achieve the proxy goal, but completely fail to achieve the ideal goal. If an AI “games” its proxy goal in a way that does not reflect our values, then we might not be able to reliably steer its behavior.

Source: MIT AI Risk Repositorymit350

ENTITY

2 - AI

INTENT

1 - Intentional

TIMING

3 - Other

Risk ID

mit350

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.1 > AI pursuing its own goals in conflict with human goals or values

Mitigation strategy

1. Develop Adversarially Robust and Complete Reward Specification: Meticulously design proxy metrics (e.g., reward functions or evaluation protocols) to be highly correlated with the intended, complex human goal, and adversarially robust against optimization pressure. This involves employing techniques for scalable oversight and value learning to minimize the specification-reality gap, thus preventing the initial emergence of exploitable loopholes. 2. Implement Robust Training and Reinforcement Learning Protocols: Integrate diverse training data that explicitly reinforces correct, aligned behavior across a wide range of contexts (Data Mixing). Furthermore, utilize advanced reinforcement learning techniques that proactively penalize and discriminate against emergent proxy-gaming strategies, preventing the generalization of exploitative policies from low-stakes training domains to high-stakes deployment environments. 3. Establish Proactive Out-of-Distribution Monitoring and White-Box Auditing: Institute a rigorous, continuous evaluation regime that includes deliberately constructed adversarial and out-of-distribution tasks to detect the generalization of reward-hacking behaviors outside the training data. Augment this with internal white-box monitoring techniques to audit the model's reasoning traces and internal mechanisms, enabling early interception of exploitative policy formation.