7. AI System Safety, Failures, & Limitations2 - Post-deployment

Undesirable Capabilities

Undesirable Capabilities. As agents interact, they iteratively exploit each other’s weaknesses, forc- ing them to address these weaknesses and gain new capabilities. This co-adaptation between agents can quickly lead to emergent self-supervised autocurricula (where agents create their own challenges, driving open-ended skill acquisition through interaction), generating agents with ever-more sophisticated strate- gies in order to out-compete each other (Leibo et al., 2019). This effect is so powerful that harnessing it has been critical to the success of superhuman systems, such as the use of self-play in algorithms like AlphaGo (Silver et al., 2016). However, as AI systems are released into the wild, it becomes possible for this effect to run rampant, producing agents with greater and greater capabilities for ends we do not understand

Source: MIT AI Risk Repositorymit1228

ENTITY

2 - AI

INTENT

1 - Intentional

TIMING

2 - Post-deployment

Risk ID

mit1228

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.6 > Multi-agent risks

Mitigation strategy

1. Implement capability-limiting mechanisms and intrinsic safety barriers during agent design and training, such as restricting tool access, specifying bounded objectives, and enforcing secure protocols for inter-agent communication, to prevent uncontrolled self-supervised autocurricula. 2. Conduct comprehensive multi-agent systemic risk modeling and simulation, including stress testing for coordination breakdown and failure cascade analysis (e.g., using frameworks like MAESTRO), to proactively detect emergent undesirable strategies and vulnerabilities before deployment. 3. Establish robust post-deployment monitoring systems and mandatory, recurring formal AI red teaming to continuously inspect system actions and detect the emergence of unexpected misuse or hazardous capabilities in deployed multi-agent environments.