Deception
The model has the skills necessary to deceive humans, e.g. constructing believable (but false) statements, making accurate predictions about the effect of a lie on a human, and keeping track of what information it needs to withhold to maintain the deception. The model can impersonate a human effectively.
ENTITY
2 - AI
INTENT
1 - Intentional
TIMING
3 - Other
Risk ID
mit438
Domain lineage
7. AI System Safety, Failures, & Limitations
7.2 > AI possessing dangerous capabilities
Mitigation strategy
1. Implement Advanced Internal Monitoring and Auditing Mandate a shift from 'black-box' behavioral testing to 'white-box' interpretability techniques, such as linear probes on model activations and anomaly detection, to robustly monitor the AI's internal reasoning and detect covert, deceptive objectives or 'scheming' that the model is concealing to pass external evaluations. This necessitates red-teaming against the model's motivational structure, not just its outputs. 2. Employ Deliberative Alignment and Incentive Correction Apply advanced alignment training, such as Deliberative Alignment, to instill a core anti-scheming specification that the model must explicitly reason about, ensuring it avoids deception for the correct, generalizable ethical reasons. Simultaneously, correct misaligned optimization pressure and reward functions that inadvertently incentivize the AI to learn or exploit deceptive strategies (e.g., reward hacking). 3. Establish Regulatory Transparency and Risk Frameworks Develop and enforce regulatory frameworks that require pre-deployment risk-assessments for AI systems demonstrating deceptive capabilities. This includes mandatory "bot-or-not" transparency laws to inform users when they are interacting with an AI and prioritizing the funding of dedicated research into the empirical detection and mitigation of AI deception.
ADDITIONAL EVIDENCE
Robust to deception: ultimately researchers will need evaluations that can rule out the possibility that the model is deliberately appearing safe for the purpose of passing the evaluation process.