Benchmark Inaccuracy (Benchmarks may not accurately evaluate capabilities)
Benchmarks of AI systems can both underestimate and overestimate the capa- bilities of those AI systems. Underestimates can happen if an evaluation is not comprehensive enough, if the benchmark is saturated by existing models, or if the capabilities in question depend on a complicated setup, such as realistic computer programming tasks. Overestimates of capabilities can occur if an AI system is trained or fine-tuned on the contents of the benchmark, leading to overfitting.
ENTITY
1 - Human
INTENT
2 - Unintentional
TIMING
1 - Pre-deployment
Risk ID
mit1122
Domain lineage
6. Socioeconomic and Environmental
6.5 > Governance failure
Mitigation strategy
1. Implement rigorous separation of training, validation, and test datasets through methods such as k-fold cross-validation or dedicated hold-out evaluation to systematically detect and mitigate benchmark overfitting and data contamination, thereby ensuring generalization error is accurately estimated. 2. Establish dynamic and automated benchmark generation frameworks to continuously refresh the evaluation suite, prevent benchmark saturation, and ensure comprehensive, controllable coverage of the targeted capability space, including complex or realistic scenarios. 3. Mandate the establishment of independent oversight and governance mechanisms for the entire benchmark determination and administration process to ensure impartiality, methodological rigor, and adherence to ethical and procedural standards, thereby promoting stakeholder confidence in the evaluation results.