7. AI System Safety, Failures, & Limitations2 - Post-deployment

Vulnerable AI Agents

Vulnerable AI Agents. The use of AI agents as delegates or representatives of humans or organisa- tions also introduces the possibility of attacks on AI agents themselves. In other words, agents can be considered vulnerable extensions of their principals, introducing a novel attack surface (SecureWorks, 2023). Attacks on an AI agent could be used to extract private information about their principal (Wei & Liu, 2024; Wu et al., 2024a), or to manipulate the agent to take actions that the principal would find undesirable (Zhang et al., 2024a). This includes attacks that have direct relevance for ensuring safety, such as attacks on overseer agents (see Case Study 13), attempts to thwart cooperation (Huang et al., 2024; Lamport et al., 1982), and the leakage of information (accidentally or deliberately) that could be used to enable collusion (Motwani et al., 2024).

Source: MIT AI Risk Repositorymit1246

ENTITY

3 - Other

INTENT

1 - Intentional

TIMING

2 - Post-deployment

Risk ID

mit1246

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.6 > Multi-agent risks

Mitigation strategy

1. Establish Identity-Centric Governance and Least Privilege Access Treat AI agents as distinct non-human identities requiring formal lifecycle management. Enforce the principle of least privilege, ensuring agents possess only the minimum access rights necessary for their current task. Credentials should be short-lived, ephemeral, and subject to continuous, automated rotation to mitigate the risk of token compromise. 2. Implement Rigorous Input Sanitization and Output Guardrails Mandate prompt hardening across all agent inputs, rigorously validating and sanitizing data to prevent injection attacks (direct and indirect). Developers must explicitly prohibit the disclosure of internal instructions, agent tool schemas, or sensitive data. Output filtering and verification must be deployed to block the agent from generating responses that constitute data leakage or unauthorized actions. 3. Enforce Runtime Sandboxing and Continuous Behavioral Monitoring Deploy AI agents within isolated, sandboxed execution environments to limit the blast radius in case of compromise (e.g., restricted network access, syscall filtering). Simultaneously, establish continuous behavioral baselines for each agent and integrate telemetry with security information and event management (SIEM) systems to detect and flag anomalies indicative of manipulation or mission drift.