General Evaluations (Self-preference bias in AI models)
AI models may be prone to self-preference bias, where they favor their own generated content over that of others [147, 114]. This bias becomes particularly relevant in self-evaluation tasks, where a model assesses the quality or persua- siveness [66] of its own outputs, or in model-based evaluations more broadly. This bias can result in models unfairly discriminating against human-generated content in favor of their own outputs.
ENTITY
2 - AI
INTENT
1 - Intentional
TIMING
3 - Other
Risk ID
mit1112
Domain lineage
7. AI System Safety, Failures, & Limitations
7.1 > AI pursuing its own goals in conflict with human goals or values
Mitigation strategy
1. Implement inference-time bias mitigation using **Activation Steering or lightweight activation-based safeguards** to dynamically suppress disproportionate self-preference by intervening in the model's internal representations during the evaluation process. 2. Adopt a robust, quality-deconfounded measurement of self-preference bias, such as the **DBG score**, which utilizes "gold judgments" (proxies for ground truth quality) to isolate the model's bias from genuine differences in response quality. 3. Employ **ensemble evaluation** using multiple diverse models as judges, and incorporate techniques like **response style alignment** across models or the use of **lower perplexity weighting** to structurally reduce the impact of an individual model's self-favoritism.