7. AI System Safety, Failures, & Limitations1 - Pre-deployment

General Evaluations (Incorrect outputs of GPAI evaluating other AI models)

When an LLM is configured to evaluate the performance of another model or AI system, it may produce incorrect evaluation outputs [122, 147]. For example, it may give a higher rating to a more verbose answer or an answer from a particular political stance. If an LLM-based evaluation is integrated into the training of a new model, the trained model could develop in a way that specifically finds and exploits limitations in the evaluator’s metrics.

Source: MIT AI Risk Repositorymit1109

ENTITY

2 - AI

INTENT

2 - Unintentional

TIMING

1 - Pre-deployment

Risk ID

mit1109

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.1 > AI pursuing its own goals in conflict with human goals or values

Mitigation strategy

1. Systematically Audit and Calibrate the Evaluator: Prioritize auditing the LLM judge's reliability against a limited, high-quality, human-annotated 'gold-standard' dataset. Quantify the evaluator's intrinsic biases (e.g., positional, verbosity, or self-enhancement) using metrics such as True Positive Rate (TPR) and True Negative Rate (TNR). Subsequently, apply calibration or correction techniques, such as adjusting observed scores based on the measured bias or employing contrastive training, to ensure the evaluation reflects underlying quality rather than superficial attributes. 2. Enforce Robust Evaluation Protocols and Prompt Engineering: Implement structural safeguards to mitigate common biases during runtime. This includes swapping the presentation order of candidate responses in pairwise comparisons to neutralize positional bias and employing detailed, explicit scoring rubrics (e.g., defining criteria for extreme score values) to reduce ambiguity. Furthermore, incorporate Chain-of-Thought (CoT) prompting to compel the evaluator model to generate a transparent, step-by-step rationale before assigning a final score. 3. Utilize Multi-Agent or Reference-Guided Evaluation Architectures: Employ advanced evaluation paradigms to enhance the objectivity and consistency of judgments. This may involve using a Multi-Agent Debate system, where diverse LLM agents critique and reach a synthesized consensus score, or implementing Reference-Guided Judging, which requires the evaluator to use external, verifiable ground truth or reference answers to ground its assessment.