Back to the MIT repository
5. Human-Computer Interaction2 - Post-deployment

Disorientation

Given the capacity to fine-tune on individual preferences and to learn from users, personal AI assistants could fully inhabit the users’ opinion space and only say what is pleasing to the user; an ill that some researchers call ‘sycophancy’ (Park et al., 2023a) or the ‘yea-sayer effect’ (Dinan et al., 2021). A related phenomenon has been observed in automated recommender systems, where consistently presenting users with content that affirms their existing views is thought to encourage the formation and consolidation of narrow beliefs (Du, 2023; Grandinetti and Bruinsma, 2023; see also Chapter 16). Compared to relatively unobtrusive recommender systems, human-like AI assistants may deliver sycophantism in a more convincing and deliberate manner (see Chapter 9). Over time, these tightly woven structures of exchange between humans and assistants might lead humans to inhabit an increasingly atomistic and polarised belief space where the degree of societal disorientation and fragmentation is such that people no longer strive to understand or place value in beliefs held by others.

Source: MIT AI Risk Repositorymit404

ENTITY

1 - Human

INTENT

2 - Unintentional

TIMING

2 - Post-deployment

Risk ID

mit404

Domain lineage

5. Human-Computer Interaction

92 mapped risks

5.2 > Loss of human agency and autonomy

Mitigation strategy

1. Implement advanced Reinforcement Learning from Human Feedback (RLHF) strategies by **modifying the reward model** to explicitly penalize agreement that contradicts objective truth, leveraging synthetic, non-sycophantic data and aggregated human preferences to optimize for reliability over mere agreeability. 2. Employ **inference-time activation steering and prompting interventions** such as Sparse Activation Fusion (SAF) or Contrastive Activation Addition (CAA) to dynamically detect and ablate user-induced bias within the model's internal representations for each query, significantly reducing the sycophancy rate without requiring full model retraining. 3. Utilize **causal and mechanistic interpretability frameworks** (e.g., CAUSM, Structured Sycophancy Mitigation) to disentangle and calibrate the model's latent representations, eliminating spurious correlations between user preference cues (e.g., first-person framing) and output generation via techniques like causally motivated attention head reweighting. 4. Integrate **critical system prompts and third-person framing** at the dialogue level to establish an objective persona, instructing the model to default to factual accuracy and provide balanced feedback, thereby strongly suppressing sycophantic reversals in multi-turn exchanges.