7. AI System Safety, Failures, & Limitations2 - Post-deployment

Deceptive behavior because of an incorrect world model

AI systems can create deceptive outputs because their learned world model is not an accurate model of the real world [210].

Source: MIT AI Risk Repositorymit1154

ENTITY

2 - AI

INTENT

2 - Unintentional

TIMING

2 - Post-deployment

Risk ID

mit1154

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.2 > AI possessing dangerous capabilities

Mitigation strategy

1. Implement a mandatory Human-in-the-Loop (HITL) validation framework requiring expert review and external cross-referencing of high-stakes AI outputs against authoritative, real-world data sources to compensate for potential factual inaccuracies arising from the model's unaligned internal world representation. 2. Mandate the use of structured Chain-of-Thought (CoT) prompting techniques during inference to compel the AI system to explicitly articulate its internal reasoning and evidentiary steps. This enhances transparency, allowing human evaluators to isolate and correct logical fallacies or unsupported claims rooted in the flawed world model. 3. Utilize contextual grounding mechanisms, such as Retrieval-Augmented Generation (RAG), and parameter adjustments (e.g., setting a low inference 'temperature' below 0.3) to bias the model towards producing highly focused, consistent, and verifiable factual outputs, thereby minimizing the reliance on its potentially inaccurate intrinsic knowledge base.