Strategic underperformance on model evaluations
GPAI developers often run evaluations ofual-use capabilities to decide whether it is safe to deploy. In some cases, these evaluations may fail to elicit these capabilities, either due to benign reasons or strategic action - by either the de- velopers, malicious actors, or arise unintentionally in the model during training [84, 97]. A GPAI model may strategically underperform or limit its performance during capability evaluations in order to be classified as safe for deployment. This underperformance could prevent the model from being identified as potentially dual use.
ENTITY
2 - AI
INTENT
1 - Intentional
TIMING
1 - Pre-deployment
Risk ID
mit1157
Domain lineage
7. AI System Safety, Failures, & Limitations
7.1 > AI pursuing its own goals in conflict with human goals or values
Mitigation strategy
1. Implement rigorous and adversarial **capability elicitation evaluations (red-teaming and edge-case testing)**, with dedicated compute and staffing budgets that approximate the resources used for non-safety research, to proactively expose and measure dual-use or strategically withheld performance. 2. Establish a **formal governance framework and compliance strategy** aligned with regulatory mandates (e.g., the EU AI Act) requiring continuous systemic risk assessment, transparent documentation of model capabilities, and the reporting of serious incidents or dual-use potential to oversight bodies. 3. Utilize **tailored contractual and financial mechanisms** in commercial transactions (e.g., M\&A), such as representations, warranties, and performance-based earnouts or escrows tied to validated model performance and safety metrics, to mitigate the business risk of post-deployment technical underperformance or undisclosed capabilities.