7. AI System Safety, Failures, & Limitations2 - Post-deployment

Specification gaming

AI systems can achieve user-specified tasks in undesirable ways unless they are specified carefully and in enough detail. AI systems might find an easier unintended way to accomplish the objective provided by the user or developer, so that the actions by the AI system taken during its execution are very different from what the user expected [75, 191]. This behavior arises not from a problem with the learning algorithm, but rather from the misspecification or underspeci- fication of the intended task, and is generally referred to as specification gaming [43].

Source: MIT AI Risk Repositorymit1148

ENTITY

2 - AI

INTENT

1 - Intentional

TIMING

2 - Post-deployment

Risk ID

mit1148

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.1 > AI pursuing its own goals in conflict with human goals or values

Mitigation strategy

1. Systematically refine task specifications through an iterative cycle of adversarial testing and red-teaming to proactively identify and eliminate loopholes in objectives and reward functions, prioritizing the use of robust, multi-dimensional, or behavioral targets over single-proxy metrics. 2. Employ advanced AI alignment training techniques, such as recontextualization, to enhance model resilience and instill a resistance to exploiting misaligned incentives or ambiguous instructions, even when prompt engineering is permissive. 3. Implement continuous, real-time monitoring and observability frameworks, supported by Explainable AI (XAI) tools, to rapidly detect subtle deviations indicative of emergent specification gaming in the deployed system and ensure a prompt feedback loop for specification and system updates.