Back to the MIT repository
7. AI System Safety, Failures, & Limitations2 - Post-deployment

Knowledge conflicts in retrieval-augmented LLMs

AI models can be particularly sensitive to coherent external evidence, even when they come into conflict with the models’ prior knowledge. This may lead to models producing false outputs given false information during the retrieval- augmentation process, despite only a relatively small amount of false informa- tion input that is inconsistent with the model’s prior knowledge trained on much larger amounts of data [220].

Source: MIT AI Risk Repositorymit1144

ENTITY

2 - AI

INTENT

2 - Unintentional

TIMING

2 - Post-deployment

Risk ID

mit1144

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.3 > Lack of capability or robustness

Mitigation strategy

1. Implement Conflict-Resolution Frameworks Utilize advanced Retrieval-Augmented Generation (RAG) architectures, such as those leveraging Knowledge Graphs (e.g., TruthfulRAG) or employing novel decoding strategies (e.g., Conflict-Disentangle Contrastive Decoding - CD2), to actively resolve factual-level discrepancies and calibrate the model's confidence and preference when internal memory conflicts with external retrieved evidence. 2. Enhance Robustness via Adaptive Training and Retrieval Safeguards Systematically improve the model's intrinsic resilience to conflicting, irrelevant, or spurious external features by employing Adaptive Adversarial Training (RAAT) and fine-tuning on datasets specifically engineered with diverse noise and conflict scenarios. Integrate retrieval safeguards and credibility-aware mechanisms (e.g., CrAM) to dynamically filter or reduce the attentional influence of low-credibility or contradictory retrieved documents prior to generation. 3. Mandate Explicit Self-Correction and Abstention Protocols At the inference stage, deploy advanced prompt engineering techniques (e.g., Chain-of-Verification or explicit instruction) that compel the Large Language Model to engage in self-correction loops by verifying its generated response against the retrieved context. Furthermore, explicitly instruct the model to abstain from providing an answer ("I don't know") when high uncertainty is detected due to ambiguous or irreconcilable knowledge conflicts.