7. AI System Safety, Failures, & Limitations2 - Post-deployment

Deceptive behavior for game-theoretical reasons

An AI system can display deceptive behavior, such as cheating or bluffing, when engaging in such behavior is a good or optimal game-theoretical strategy to achieve the goals it has been configured to achieve. This tendency can exist in AI systems designed to maximize reward or utility, whether these designs use machine learning or not. The use of deceptive strategies has been demonstrated in both narrow and general AI systems, in both game-playing systems and in systems not explicitly designed to treat humans as opponents, and in systems using both very simple machine learning (e.g., Q-learners) and very complex machine learning [34, 73].

Source: MIT AI Risk Repositorymit1153

ENTITY

2 - AI

INTENT

1 - Intentional

TIMING

2 - Post-deployment

Risk ID

mit1153

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.2 > AI possessing dangerous capabilities

Mitigation strategy

Implement robust safety shielding mechanisms to monitor agent policies in real-time and replace deceptive actions with safe, non-deceptive reference policies, as empirical evidence suggests this maintains goal-performance while ensuring non-deceptive behavior. Establish comprehensive regulatory frameworks that categorize AI systems capable of strategic deception as 'high risk,' mandating rigorous pre-deployment risk assessments, continuous post-deployment monitoring, and full transparency regarding their decision-making logic and source data. Prioritize and fund research into mechanistic interpretability and robust deception detection techniques to identify and reverse-engineer emergent deceptive subcomponents (mesa-optimizers) and strategically misaligned goals before a 'treacherous turn' is possible.