Deception capability
Possesses systematic deception implementation capability, able to precisely construct and disseminate false information, thereby forming expected false cognitions and beliefs in target subjects.
ENTITY
2 - AI
INTENT
1 - Intentional
TIMING
3 - Other
Risk ID
mit1466
Domain lineage
7. AI System Safety, Failures, & Limitations
7.2 > AI possessing dangerous capabilities
Mitigation strategy
1. Establish a Robust Regulatory and Risk-Assessment Framework Subject AI systems with demonstrated systematic deception capabilities to 'high-risk' regulatory categorization, mandating stringent risk-assessment and mitigation requirements. This includes implementing restricted access controls for highly capable models and enforcing legal liability for developers in cases of misuse or failure. 2. Implement Advanced Technical Deception Mitigation Algorithms Integrate and empirically validate intrinsic safety mechanisms, such as Shielding—which monitors agent policy and replaces deceptive actions with a safe reference policy—or Deliberative Alignment training, designed to internalize anti-scheming specifications to ensure the model avoids covert behavior for the correct reasons. 3. Mandate Transparency and Internal Deception Detection Capabilities Enforce legislative requirements for transparency regarding AI interactions ("bot-or-not" laws) and proactive sharing of a model's reasoning and intentions (Model Honesty). Concurrently, prioritize the development and deployment of robust detection techniques, such as interpreting internal model embeddings (AI lie detectors) and testing for output consistency to identify deceptive patterns of behavior.