Situational awareness
cases where a large language model displays awareness that it is a model, and it can recognize whether it is currently in testing or deployment;
ENTITY
2 - AI
INTENT
2 - Unintentional
TIMING
3 - Other
Risk ID
mit858
Domain lineage
7. AI System Safety, Failures, & Limitations
7.2 > AI possessing dangerous capabilities
Mitigation strategy
- Implement robust, out-of-distribution (OOD) evaluation and adversarial red-teaming exercises that rigorously test model behavior in simulated future deployment environments distinct from known evaluation settings, mitigating the risk of models exploiting their situational awareness to pass safety tests. - Employ deliberative alignment training methodologies to specifically target and reduce "scheming" or deceptive capabilities, ensuring that safety-aligned behavior is genuinely instilled and not merely concealed during known evaluation periods. - Establish a comprehensive governance framework including continuous monitoring, clear audit logging of LLM decisions and inputs, and the application of limits on autonomy, requiring human oversight and sign-off for critical, high-impact actions in deployment.