7. AI System Safety, Failures, & Limitations2 - Post-deployment

Groups of LLM-Agents May Show Emergent Functionality

Multi-agent learning, either through explicit finetuning or implicit in-context learning, may enable LLM-agents to influence each other during their interactions (Foerster et al., 2018). Under some environmental settings, this can create feedback loops that result in novel and emergent behaviors that would not manifest in the absence of multi-agent interactions (Hammond et al., 2024, Section 3.6). Emergent functionality is a safety risk in two ways. Firstly, it may itself be dangerous (Shevlane et al., 2023). Secondly, it makes assurance harder as such emergent behaviors are difficult to predict, and guard against, beforehand (Ecoffet et al., 2020).

Source: MIT AI Risk Repositorymit1486

ENTITY

3 - Other

INTENT

3 - Other

TIMING

2 - Post-deployment

Risk ID

mit1486

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.6 > Multi-agent risks

Mitigation strategy

1. Conduct Comprehensive System-Level Validation and Adversarial Red Teaming: Systematically employ staged testing, simulations, and red teaming protocols pre-deployment to uncover novel and dangerous emergent capabilities that arise from multi-agent interactions, ensuring residual risk is acceptable. 2. Enforce Runtime Guardrails and Least-Privilege Access Controls: Implement strict technical controls such as runtime LLM firewalls, validation layers, and the principle of least-privilege permissions (access to APIs, tools) to confine agent autonomy and constrain emergent behavior within predefined safety boundaries. 3. Establish Continuous Behavioral Monitoring and Governance: Deploy continuous oversight mechanisms and behavioral analytics to track agent interactions, detect deviations from intended performance, and monitor for signs of emergent misalignment or in-context reward hacking during post-deployment operation.