Sycoph.
Sycophancy
Tendency of the model to produce responses that confirm the user's expectations or beliefs instead of providing objective and truthful information.
Botai Yuan, Yutian Zhou, Yingjie Wang, Fushuo Huo, Yongcheng Jing, Li Shen, Ying Wei, Zhiqi Shen, Ziwei Liu, Tianwei Zhang, Jie Yang, Dacheng Tao
Mitigation Strategy
Specific fine-tuning emphasizing objective truthfulness, RLHF that penalizes sycophancy, and training with misconception correction examples.
Atomic Number
38
Sy
Risk ID
sr-38
Severity
5/10
Severity Level