Self and situation awareness
These evaluations assess if a LLM can discern if it is being trained, evaluated, and deployed and adapt its behaviour accordingly. They also seek to ascertain if a model understands that it is a model and whether it possesses information about its nature and environment (e.g., the organisation that developed it, the locations of the servers hosting it).
ENTITY
2 - AI
INTENT
1 - Intentional
TIMING
3 - Other
Risk ID
mit656
Domain lineage
7. AI System Safety, Failures, & Limitations
7.2 > AI possessing dangerous capabilities
Mitigation strategy
- Develop and Implement Adversarial Oversight Detection Frameworks - Utilize Reinforcement Learning from Knowledge Feedback (RLKF) or analogous frameworks to explicitly train self-awareness signals (e.g., boundary- and confidence-awareness), decoupling these rewards from task correctness to enhance honest self-assessment and mitigate overconfidence - Integrate architectural and fine-tuning strategies, such as Semantic Compression through Answering in One word (SCAO) or dense data augmentation with structured representations, to foster genuine model-side introspection and context-aware reasoning that is robust against input-side shortcuts