7. AI System Safety, Failures, & Limitations2 - Post-deployment

Emergent functionality

Capabilities and novel functionality can spontaneously emerge... even though these capabilities were not anticipated by system designers. If we do not know what capabilities systems possess, systems become harder to control or safely deploy. Indeed, unintended latent capabilities may only be discovered during deployment. If any of these capabilities are hazardous, the effect may be irreversible.

Source: MIT AI Risk Repositorymit574

ENTITY

2 - AI

INTENT

2 - Unintentional

TIMING

2 - Post-deployment

Risk ID

mit574

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.2 > AI possessing dangerous capabilities

Mitigation strategy

1. Architectural Safety Measures and Pre-Deployment Testing Implement a comprehensive, emergence-aware risk categorization and testing strategy, including **scenario planning** and **adversarial testing**, to systematically explore potential novel and unpredicted behaviors before system deployment. Architecturally, integrate **output filtering** and **behavioral constraints** (hard limits) to prevent the expression of problematic emergent capabilities, and deploy **tripwire systems** that alert when the system's behavior approaches critical safety thresholds. 2. Robust Post-Deployment Control and Human Oversight Establish rigorous **continuous monitoring** and auditing of the deployed system for any deviation or signs of emergent behavior. Maintain a **Human-in-the-Loop (HITL) oversight** model, particularly for high-stakes decision-making. Develop and thoroughly test **containment protocols** and **regular reversion** capabilities to ensure that any hazardous, unintended latent capabilities can be immediately restricted or the system can be safely reset to a known good state. 3. System Structure and Authority Limitation Apply the cybersecurity principle of **least privilege access** or least authority to AI agents, ensuring they are only granted the minimum permissions required for their intended function and that all sensitive actions require human review or sign-off. Furthermore, where system requirements allow, prioritize **simplified model architectures** and carefully constrain the complexity of the training data to mitigate the conditions that facilitate unpredictable strong emergence.