7. AI System Safety, Failures, & Limitations2 - Post-deployment

Undetectable Threats

Undetectable Threats. Cooperation and trust in many multi-agent systems relies crucially on the ability to detect (and then avoid or sanction) adversarial actions taken by others (Ostrom, 1990; Schneier, 2012). Recent developments, however, have shown that AI agents are capable of both steganographic communication (Motwani et al., 2024; Schroeder de Witt et al., 2023b) and ‘illusory’ attacks (Franzmeyer et al., 2023), which are black-box undetectable and can even be hidden using white-box undetectable encrypted backdoors (Draguns et al., 2024). Similarly, in environments where agents learn from interac- tions with others, it is possible for agents to secretly poison the training data of others (Halawi et al., 2024; Wei et al., 2023). If left unchecked, these new attack methods could rapidly destabilise cooperation and coordination in multi-agent systems.

Source: MIT AI Risk Repositorymit1248

ENTITY

2 - AI

INTENT

1 - Intentional

TIMING

2 - Post-deployment

Risk ID

mit1248

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.6 > Multi-agent risks

Mitigation strategy

1. Implement Advanced Inter-Agent Observability and Communication Security: Establish continuous, layered monitoring of all inter-agent communication and behavioral patterns, employing techniques such as multi-instance analysis, temporal pattern detection, and deviation from baseline modeling. Secure communication channels with cryptographic authentication and immutable logging to prevent steganographic and man-in-the-middle attacks. 2. Employ Adversarial Training and Rigorous Data Provenance Tracking: Proactively fortify agent models by incorporating adversarial training on specifically designed malicious data to enhance robustness against black-box and illusory attacks. Simultaneously, mandate strict data validation and provenance tracking throughout the entire training and fine-tuning pipeline to mitigate the risk of training data poisoning. 3. Establish Continuous Model Integrity and Output Validation Protocols: Institute cryptographic verification of agent outputs and conduct regular behavioral analysis to detect latent manipulation or backdoors (e.g., integrity attacks). This process should involve frequent model resets and multi-model validation checks to prevent the accumulation and persistence of undetectable threats.