Undesirable Capabilities
Undesirable Capabilities. As agents interact, they iteratively exploit each other’s weaknesses, forc- ing them to address these weaknesses and gain new capabilities. This co-adaptation between agents can quickly lead to emergent self-supervised autocurricula (where agents create their own challenges, driving open-ended skill acquisition through interaction), generating agents with ever-more sophisticated strate- gies in order to out-compete each other (Leibo et al., 2019). This effect is so powerful that harnessing it has been critical to the success of superhuman systems, such as the use of self-play in algorithms like AlphaGo (Silver et al., 2016). However, as AI systems are released into the wild, it becomes possible for this effect to run rampant, producing agents with greater and greater capabilities for ends we do not understand
ENTITY
2 - AI
INTENT
1 - Intentional
TIMING
2 - Post-deployment
Risk ID
mit1228
Domain lineage
7. AI System Safety, Failures, & Limitations
7.6 > Multi-agent risks
Mitigation strategy
1. Implement capability-limiting mechanisms and intrinsic safety barriers during agent design and training, such as restricting tool access, specifying bounded objectives, and enforcing secure protocols for inter-agent communication, to prevent uncontrolled self-supervised autocurricula. 2. Conduct comprehensive multi-agent systemic risk modeling and simulation, including stress testing for coordination breakdown and failure cascade analysis (e.g., using frameworks like MAESTRO), to proactively detect emergent undesirable strategies and vulnerabilities before deployment. 3. Establish robust post-deployment monitoring systems and mandatory, recurring formal AI red teaming to continuously inspect system actions and detect the emergence of unexpected misuse or hazardous capabilities in deployed multi-agent environments.