Natural Language Underspecifies Goals
For LLM-agents, both the goal and environment observations are typically specified in the prompt through natural language. While natural language may provide a richer and more natural means of specifying goals than alternatives such as hand-engineering objective functions, natural language still suffers from underspecification (Grice, 1975; Piantadosi et al., 2012). Furthermore, in practice, users may neglect fully specifying their goals, especially the information pertaining to elements of the environment that ought not to be changed (the classic frame problem (Shanahan, 2016)). Such underspecification (D’Amour et al., 2020), if not accounted for, can result in negative side-effects (Amodei et al., 2016), i.e. the agent succeeding at the given task but also changing the environment in undesirable ways
ENTITY
3 - Other
INTENT
2 - Unintentional
TIMING
1 - Pre-deployment
Risk ID
mit1481
Domain lineage
7. AI System Safety, Failures, & Limitations
7.1 > AI pursuing its own goals in conflict with human goals or values
Mitigation strategy
1. Implement a Proactive Interaction Framework to detect underspecified instructions and prompt the user for clarification before task execution, significantly improving performance and reducing reliance on the agent to infer missing requirements. 2. Establish Formal Safety Specifications and Behavioral Guardrails within the agent architecture to prevent undesirable side-effects or environment changes (addressing the classic frame problem), providing a fail-safe against goal underspecification. 3. Employ a Systematic Requirements Management process during development, utilizing requirements-aware prompt optimization to ensure all critical constraints and desired behaviors are explicitly specified, thus reducing the initial likelihood of underspecified inputs reaching the agent.