6. Socioeconomic and Environmental1 - Pre-deployment

General Evaluations (Limited coverage of capabilities evaluations)

GPAI model developers might run capabilities evaluations to determine whether it has dangerous or dual-use capabilities, and then decide whether it is safe to deploy. Such capabilities evaluations can fail to demonstrate all the capabilities of a model. For example, evaluations may miss certain capabilities that are difficult to assess, prohibitively costly to verify, or obscured by the model’s tendency to refuse responses due to safety training, even if it possesses some of these capabilities.

Source: MIT AI Risk Repositorymit1110

ENTITY

1 - Human

INTENT

2 - Unintentional

TIMING

1 - Pre-deployment

Risk ID

mit1110

Domain lineage

6. Socioeconomic and Environmental

262 mapped risks

6.5 > Governance failure

Mitigation strategy

1. **Mandate Advanced Adversarial Red-Teaming and Evasion Analysis:** Prioritize the implementation of targeted, sustained adversarial testing campaigns conducted by independent domain experts. This process must specifically focus on developing and documenting prompt engineering techniques, contextual manipulations, and model-chaining strategies designed to circumvent safety-induced refusal behaviors and expose latent, potentially dual-use capabilities that current evaluations may obscure. 2. **Develop Standardized, Layered Capabilities Evaluation Benchmarks:** Systemically address the limited coverage by requiring model developers to adopt and report against publicly available, cross-organizational capability evaluation benchmarks. Evaluations must integrate multiple methodologies, including automated testing, human-expert structured red-teaming, and model interpretability techniques, to ensure robust coverage of both known and emergent risk vectors prior to deployment. 3. **Establish a Pre-Deployment Residual Risk and Verification Limitation Disclosure Framework:** For capabilities deemed prohibitively costly or difficult to verify, the deployment process must mandate transparent documentation of the evaluation limitations, including the scope of untested capabilities and the estimated residual risk. This disclosure is subject to an independent governance body review to ensure the acceptability of the unverifiable risk profile before the model's public release.