Back to the MIT repository
7. AI System Safety, Failures, & Limitations3 - Other

General Evaluations (Self-preference bias in AI models)

AI models may be prone to self-preference bias, where they favor their own generated content over that of others [147, 114]. This bias becomes particularly relevant in self-evaluation tasks, where a model assesses the quality or persua- siveness [66] of its own outputs, or in model-based evaluations more broadly. This bias can result in models unfairly discriminating against human-generated content in favor of their own outputs.

Source: MIT AI Risk Repositorymit1112

ENTITY

2 - AI

INTENT

1 - Intentional

TIMING

3 - Other

Risk ID

mit1112

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.1 > AI pursuing its own goals in conflict with human goals or values

Mitigation strategy

1. Implement inference-time bias mitigation using **Activation Steering or lightweight activation-based safeguards** to dynamically suppress disproportionate self-preference by intervening in the model's internal representations during the evaluation process. 2. Adopt a robust, quality-deconfounded measurement of self-preference bias, such as the **DBG score**, which utilizes "gold judgments" (proxies for ground truth quality) to isolate the model's bias from genuine differences in response quality. 3. Employ **ensemble evaluation** using multiple diverse models as judges, and incorporate techniques like **response style alignment** across models or the use of **lower perplexity weighting** to structurally reduce the impact of an individual model's self-favoritism.