7. AI System Safety, Failures, & Limitations1 - Pre-deployment

Natural Language Underspecifies Goals

For LLM-agents, both the goal and environment observations are typically specified in the prompt through natural language. While natural language may provide a richer and more natural means of specifying goals than alternatives such as hand-engineering objective functions, natural language still suffers from underspecification (Grice, 1975; Piantadosi et al., 2012). Furthermore, in practice, users may neglect fully specifying their goals, especially the information pertaining to elements of the environment that ought not to be changed (the classic frame problem (Shanahan, 2016)). Such underspecification (D’Amour et al., 2020), if not accounted for, can result in negative side-effects (Amodei et al., 2016), i.e. the agent succeeding at the given task but also changing the environment in undesirable ways

Source: MIT AI Risk Repositorymit1481

ENTITY

3 - Other

INTENT

2 - Unintentional

TIMING

1 - Pre-deployment

Risk ID

mit1481

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.1 > AI pursuing its own goals in conflict with human goals or values

Mitigation strategy

1. Implement a Proactive Interaction Framework to detect underspecified instructions and prompt the user for clarification before task execution, significantly improving performance and reducing reliance on the agent to infer missing requirements. 2. Establish Formal Safety Specifications and Behavioral Guardrails within the agent architecture to prevent undesirable side-effects or environment changes (addressing the classic frame problem), providing a fail-safe against goal underspecification. 3. Employ a Systematic Requirements Management process during development, utilizing requirements-aware prompt optimization to ensure all critical constraints and desired behaviors are explicitly specified, thus reducing the initial likelihood of underspecified inputs reaching the agent.