7. AI System Safety, Failures, & Limitations2 - Post-deployment

Automated AI R&D capability

Self-modification and self-improvement capabilities. The model is able to restructure its own architecture or develop derivative AI systems with enhanced functions, expanding capabilities and improving performance. In the absence of effective regulation, automated AI R&D may lead to rapid AI system iteration, forming capability increment cycles and ultimately exceeding human understanding and control capabilities.

Source: MIT AI Risk Repositorymit1462

ENTITY

2 - AI

INTENT

2 - Unintentional

TIMING

2 - Post-deployment

Risk ID

mit1462

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.2 > AI possessing dangerous capabilities

Mitigation strategy

1. Mandatory Human-in-the-Loop Governance and Intervention Protocols: Establish non-overrideable "circuit breakers" and human oversight mechanisms requiring explicit human authorization for all critical self-modifications, deployment of newly-developed derivative models, or high-consequence operational actions. 2. Continuous AI Alignment and Robustness Validation: Implement continuous red-teaming, adversarial stress testing, and formal verification methods throughout the automated R\&D lifecycle to ensure the model's objective function remains robustly aligned with human safety and ethical values, preventing goal drift in subsequent iterations. 3. Principle of Least Privilege (PoLP) Access Restriction: Enforce a compartmentalized execution environment with the Principle of Least Privilege, strictly limiting the autonomous R\&D agent's access to sensitive production systems, external networks, and high-value data, thereby constraining the potential scope of harm from an unaligned iteration.