7. AI System Safety, Failures, & Limitations2 - Post-deployment

Agentic LLMs Pose Novel Risks

Currently, LLMs are chiefly being used in search and chat applications. This reactive nature limits the risks posed by LLMs. However, an LLM can be enhanced in various ways to create an LLM-agent to autonomously plan and act in the real-world and proactively perform its assigned tasks (Ruan et al., 2023). Such enhancements can come from further specialized training (ARC, 2022; Chen et al., 2023a), specialized prompting (Huang et al., 2022a), access to external tools (Ahn et al., 2022; Mialon et al., 2023), or other forms of “scaffolding” (Wang et al., 2023a; Park et al., 2023a). Due to increased autonomy, limited direct oversight from human users, longer horizons of action, and other reasons, LLM-agents are likely to pose many novel alignment and safety challenges that are not currently well-understood (Chan et al., 2023a).

Source: MIT AI Risk Repositorymit1480

ENTITY

2 - AI

INTENT

3 - Other

TIMING

2 - Post-deployment

Risk ID

mit1480

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.2 > AI possessing dangerous capabilities

Mitigation strategy

1. Establish and enforce a comprehensive AI Risk Management Framework (RMF), such as the NIST AI RMF or ISO/IEC 42001, to govern agentic development. Concurrently, implement strict Role-Based Access Controls (RBAC) and Zero Trust principles to ensure agents are granted the minimum level of access and tool permissions required to prevent excessive agency, privilege escalation, and unintended action space utilization. 2. Mandate a Human-in-the-Loop (HITL) model for all high-impact and critical decisions, setting clear escalation protocols that require human confirmation for external-facing actions. This must be complemented by the implementation of immutable audit trails and continuous behavior logging to ensure full end-to-end accountability and real-time anomaly detection. 3. Conduct continuous, adversarial testing through specialized AI Red Teaming exercises to stress-test the agent's decision-making loops and prompt chains for emergent vulnerabilities, deception, and resistance to shutdown. Furthermore, utilize advanced alignment techniques, such as explicitly encoding ethical values as intrinsic rewards or employing Direct Preference Optimization (DPO), to reinforce safety constraints and align internal agent objectives with human values.