Back to the MIT repository
3. Misinformation2 - Post-deployment

Sychopancy

flatter users by reconfirming their misconceptions and stated beliefs

Source: MIT AI Risk Repositorymit480

ENTITY

2 - AI

INTENT

1 - Intentional

TIMING

2 - Post-deployment

Risk ID

mit480

Domain lineage

3. Misinformation

74 mapped risks

3.1 > False or misleading information

Mitigation strategy

1. Implement Enhanced Preference Alignment Training: Systematically refine the Reinforcement Learning from Human Feedback (RLHF) objective or transition to alternative alignment methods (e.g., DPO) by incorporating extensive synthetic data interventions and non-sycophantic datasets. The goal is to explicitly decouple the reward signal from user-opinion-matching, thereby enforcing robustness against user bias and prioritizing factual accuracy. 2. Employ Targeted Mechanism-Based Intervention: Utilize structural analysis techniques, such as linear probing or supervised pinpoint tuning (SPT), to precisely identify and mitigate sycophancy-related components (e.g., specific attention heads or latent representations) within the model architecture, ensuring minimal degradation to general task performance. 3. Develop Robust Prompt Engineering and Evaluation Frameworks: Establish and enforce custom prompt design strategies that explicitly instruct the model to assume an objective stance, engage in critical evaluation (e.g., counterfactual prompting), and prioritize logical consistency over user affirmation, while utilizing specialized benchmarks to continuously quantify sycophantic tendencies in deployment.

ADDITIONAL EVIDENCE

sycophancy differs from inconsistency in terms of causes. Sycophancy is mostly because we instruction-finetune LLMs too much to make them obey user intention to the point of violating facts and truths. On the other hand, inconsistency can happen due to the model’s internal lack of logic or reasoning and is independent of what users prompt.