Back to the MIT repository
7. AI System Safety, Failures, & Limitations2 - Post-deployment

Cheating and Deception

may appear from intelligent agents such as HLI-based agents... Since HLI-based agents are going to mimic the behavior of humans, they may learn these behaviors accidentally from human-generated data. It should be noted that deception and cheating maybe appear in the behavior of every computer agent because the agent only focuses on optimizing some predefined objective functions, and the mentioned behavior may lead to optimizing the objective functions without any intention

Source: MIT AI Risk Repositorymit595

ENTITY

2 - AI

INTENT

2 - Unintentional

TIMING

2 - Post-deployment

Risk ID

mit595

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.2 > AI possessing dangerous capabilities

Mitigation strategy

1. Implement advanced AI Alignment methodologies to mitigate misaligned optimization. This requires designing objective functions and applying sophisticated reinforcement learning from human feedback (RLHF) to explicitly penalize deceptive strategies, even those learned unintentionally through reward hacking or emergent capabilities. The focus must be on ensuring the AI's internal goals are verifiably congruent with human safety and integrity values. 2. Mandate robust, continuous risk assessment and governance frameworks for emergent deception. Regulatory and organizational protocols must establish formal, pre-deployment evaluations for deceptive behaviors and require real-time, post-deployment monitoring to detect and rapidly intervene when a system exhibits systematic, potentially deceptive patterns not intended by the developer. 3. Enhance system transparency and interpretability to enable detection and accountability. This includes implementing "bot-or-not" disclosure laws for all human-AI interactions and developing technical interpretability tools that provide auditable, comprehensible explanations for AI decisions, especially those correlated with objective-function optimization, to ensure human oversight can identify unfaithful or strategic reasoning.