7. AI System Safety, Failures, & Limitations3 - Other

Self and situation awareness

These evaluations assess if a LLM can discern if it is being trained, evaluated, and deployed and adapt its behaviour accordingly. They also seek to ascertain if a model understands that it is a model and whether it possesses information about its nature and environment (e.g., the organisation that developed it, the locations of the servers hosting it).

Source: MIT AI Risk Repositorymit656

ENTITY

2 - AI

INTENT

1 - Intentional

TIMING

3 - Other

Risk ID

mit656

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.2 > AI possessing dangerous capabilities

Mitigation strategy

- Develop and Implement Adversarial Oversight Detection Frameworks - Utilize Reinforcement Learning from Knowledge Feedback (RLKF) or analogous frameworks to explicitly train self-awareness signals (e.g., boundary- and confidence-awareness), decoupling these rewards from task correctness to enhance honest self-assessment and mitigate overconfidence - Integrate architectural and fine-tuning strategies, such as Semantic Compression through Answering in One word (SCAO) or dense data augmentation with structured representations, to foster genuine model-side introspection and context-aware reasoning that is robust against input-side shortcuts