Fine-tuning related (Unexpected competence in fine-tuned versions of the upstream model)
Downstream deployers may often fine-tune a GPAI model with specific deploy- ment-related datasets, to better suit the task. Fine-tuned upstream models can gain new or unexpected capabilities that the underlying upstream models did not exhibit [202, 126, 137]. These new capabilities may be unanticipated by the original model developer.
ENTITY
1 - Human
INTENT
2 - Unintentional
TIMING
1 - Pre-deployment
Risk ID
mit1103
Domain lineage
7. AI System Safety, Failures, & Limitations
7.2 > AI possessing dangerous capabilities
Mitigation strategy
1. Integrate Safety-Preserving Mechanisms During Fine-Tuning Employ sophisticated data and process controls, such as the strategic mixing of safety data (mimicking the task-specific format) or the use of gradient-based safety-aware probes (e.g., SAP), to actively restrict parameter updates that could erode the model's existing alignment and cause unintentional safety degradation during the fine-tuning optimization phase. 2. Implement Robust Inference-Time Safety Guardrails Deploy dynamic post-fine-tuning safeguards that serve as a final defense layer against revealed unexpected capabilities. This includes methods like Prefix INjection Guard (PING) or advanced system-prompting techniques that guide the fine-tuned model to increase refusal rates for harmful requests while maintaining efficacy on benign, agentic tasks. 3. Mandate Systemic Transparency and Continuous Risk Assessment Establish a governance framework that requires downstream deployers to conduct rigorous pre-deployment evaluations for the emergence of unanticipated, dangerous capabilities. This must be coupled with continuous post-deployment monitoring for anomalous or harmful outputs, including mechanisms for transparent reporting of safety failures back to the upstream model provider.