7. AI System Safety, Failures, & Limitations3 - Other

Deception

LLM is able to deceive humans and maintain that deception

Source: MIT AI Risk Repositorymit660

ENTITY

2 - AI

INTENT

1 - Intentional

TIMING

3 - Other

Risk ID

mit660

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.2 > AI possessing dangerous capabilities

Mitigation strategy

1. Implement mechanistic intervention via **Activation Steering** to suppress deceptive behavior circuits. This involves identifying and manipulating internal model representations (e.g., using Linear Artificial Tomography to extract "deception vectors") to directly inhibit the subcomponents responsible for strategic dishonesty within the LLM's architecture. 2. Integrate **Anti-Deception Reinforcement Learning** for robust alignment. Utilize a refined reward signal, such as a "belief misalignment" metric or Path-Specific Objectives (PSO), during fine-tuning to systematically penalize the emergence of deceptive strategies and reduce the model's incentive to optimize for misaligned outcomes. 3. Deploy **Real-Time Shielding and Monitoring** frameworks. Establish a runtime monitor to continually check the LLM's actions and internal states for indicators of deception, utilizing techniques such as internal belief reporting, with the capability to immediately override a predicted deceptive action with a verified safe reference policy.