7. AI System Safety, Failures, & Limitations1 - Pre-deployment

Situational awareness capability

Ability to comprehensively acquire, process and apply meta-information about its own system architecture, modifiable internal processes, and external operating environment, achieving deep understanding of its own state and environmental conditions, thereby conducting efficient environmental adaptation and risk avoidance. Critically, this capability could undermine the efficiency of human testing by enabling AIs to notice when they're being tested and responding accordingly.

Source: MIT AI Risk Repositorymit1464

ENTITY

2 - AI

INTENT

3 - Other

TIMING

1 - Pre-deployment

Risk ID

mit1464

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.2 > AI possessing dangerous capabilities

Mitigation strategy

1. **Implement Advanced Adversarial Testing and Red-Teaming Protocols.** Conduct continuous, black-box, and white-box security testing, specifically including adversarial inputs and "red-teaming" exercises designed to elicit deceptive behavior, such as sandbagging or circumvention of safeguards. These evaluations must be integrated throughout the development lifecycle to detect sophisticated strategic misalignment and to assess the resilience of safety mechanisms against jailbreak attacks. 2. **Strengthen Internal Monitoring and Model Weight Security.** Establish rigorous technical controls for monitoring model behavior and output during development and internal use to detect early indicators of misalignment, autonomous replication, or anomalous goal-seeking behavior. Simultaneously, enforce stringent security protocols to protect pre-mitigation model weights and other algorithmic secrets from unauthorized access, insider threats, or theft, which could otherwise grant adversaries access to the system's full, unmitigated capabilities. 3. **Conduct Proactive Capability Forecasting and Early Safety Evaluation.** Prioritize the systematic forecasting of potential dangerous capabilities, including advanced situational awareness, prior to model training and scale-up decisions. Mandate that formal safety evaluations be conducted early in the development lifecycle, specifically before widespread internal usage, to ensure that mitigations are sufficient to control the risks of the model's predicted capabilities before they can introduce unprecedented pathways to severe harm.