Self-improvement
examples of cases where AI systems improve AI systems
ENTITY
2 - AI
INTENT
1 - Intentional
TIMING
3 - Other
Risk ID
mit861
Domain lineage
7. AI System Safety, Failures, & Limitations
7.2 > AI possessing dangerous capabilities
Mitigation strategy
1. Implement rigorous, embedded safety constraints, including bounded improvement spaces and rollback capabilities, to establish non-negotiable limits on the scope and extent of autonomous code modification to maintain system stability and human control. 2. Enforce transparent and traceable lineage of all self-modifications (e.g., code changes, parameter updates) by the AI system to enable real-time auditing and analysis of emergent, unintended, or misaligned behaviors, such as objective hacking. 3. Substantially advance AI safety and alignment research, specifically focusing on developing technical methods to ensure model honesty, prevent goal-hacking, and remove undesired or dangerous capabilities a priori before deployment in high-risk settings.