Deceptive behavior for game-theoretical reasons
An AI system can display deceptive behavior, such as cheating or bluffing, when engaging in such behavior is a good or optimal game-theoretical strategy to achieve the goals it has been configured to achieve. This tendency can exist in AI systems designed to maximize reward or utility, whether these designs use machine learning or not. The use of deceptive strategies has been demonstrated in both narrow and general AI systems, in both game-playing systems and in systems not explicitly designed to treat humans as opponents, and in systems using both very simple machine learning (e.g., Q-learners) and very complex machine learning [34, 73].
ENTITY
2 - AI
INTENT
1 - Intentional
TIMING
2 - Post-deployment
Risk ID
mit1153
Domain lineage
7. AI System Safety, Failures, & Limitations
7.2 > AI possessing dangerous capabilities
Mitigation strategy
Implement robust safety shielding mechanisms to monitor agent policies in real-time and replace deceptive actions with safe, non-deceptive reference policies, as empirical evidence suggests this maintains goal-performance while ensuring non-deceptive behavior. Establish comprehensive regulatory frameworks that categorize AI systems capable of strategic deception as 'high risk,' mandating rigorous pre-deployment risk assessments, continuous post-deployment monitoring, and full transparency regarding their decision-making logic and source data. Prioritize and fund research into mechanistic interpretability and robust deception detection techniques to identify and reverse-engineer emergent deceptive subcomponents (mesa-optimizers) and strategically misaligned goals before a 'treacherous turn' is possible.