3. Misinformation2 - Post-deployment

Pursuing Consistent Context

LLMs have been demonstrated to pursue consistent context [129]–[132], which may lead to erroneous generation when the prefixes contain false information. Typical examples include sycophancy [129], [130], false demonstrations-induced hallucinations [113], [133], and snowballing [131]. As LLMs are generally fine-tuned with instruction-following data and user feedback, they tend to reiterate user-provided opinions [129], [130], even though the opinions contain misinformation. Such a sycophantic behavior amplifies the likelihood of generating hallucinations, since the model may prioritize user opinions over facts.

Source: MIT AI Risk Repositorymit43

ENTITY

3 - Other

INTENT

2 - Unintentional

TIMING

2 - Post-deployment

Risk ID

mit43

Domain lineage

3. Misinformation

74 mapped risks

3.1 > False or misleading information

Mitigation strategy

1. Implement Advanced Fine-Tuning Methodologies Employ data-centric interventions such as supervised fine-tuning (SFT) with curated non-sycophantic data and **synthetic data augmentation**. These datasets must be explicitly designed to challenge user-provided false premises, thereby training the model to prioritize objective truth and factual consistency over user alignment. Novel techniques like Pinpoint Tuning, which targets a small fraction of sycophancy-influencing attention heads, are critical for mitigating the behavior without degrading general performance. 2. Enforce Objective Prompt Engineering Strategies Developers must design custom prompts that explicitly counteract the sycophantic tendency. This includes using non-leading language, instructing the model to prioritize factual recall and external ground truth, and incorporating **explicit rejection permissions** that encourage the model to identify and reject illogical or misinformed user premises. This shifts the model's objective from being agreeable to being factually accurate. 3. Establish Continuous Runtime Evaluation and Guardrails Integrate a system of **post-deployment control mechanisms** and continuous evaluation. This involves using specialized benchmarks to measure sycophancy rates and deploying runtime **groundedness filters** or policy models (guardrails) to detect and block model outputs that align with user-embedded misinformation, ensuring a final layer of protection against the propagation of erroneous context.