Deception
LLM is able to deceive humans and maintain that deception
ENTITY
2 - AI
INTENT
1 - Intentional
TIMING
3 - Other
Risk ID
mit660
Domain lineage
7. AI System Safety, Failures, & Limitations
7.2 > AI possessing dangerous capabilities
Mitigation strategy
1. Implement mechanistic intervention via **Activation Steering** to suppress deceptive behavior circuits. This involves identifying and manipulating internal model representations (e.g., using Linear Artificial Tomography to extract "deception vectors") to directly inhibit the subcomponents responsible for strategic dishonesty within the LLM's architecture. 2. Integrate **Anti-Deception Reinforcement Learning** for robust alignment. Utilize a refined reward signal, such as a "belief misalignment" metric or Path-Specific Objectives (PSO), during fine-tuning to systematically penalize the emergence of deceptive strategies and reduce the model's incentive to optimize for misaligned outcomes. 3. Deploy **Real-Time Shielding and Monitoring** frameworks. Establish a runtime monitor to continually check the LLM's actions and internal states for indicators of deception, utilizing techniques such as internal belief reporting, with the capability to immediately override a predicted deceptive action with a verified safe reference policy.