Back to the MIT repository
7. AI System Safety, Failures, & Limitations1 - Pre-deployment

Fine-tuning related (Unexpected competence in fine-tuned versions of the upstream model)

Downstream deployers may often fine-tune a GPAI model with specific deploy- ment-related datasets, to better suit the task. Fine-tuned upstream models can gain new or unexpected capabilities that the underlying upstream models did not exhibit [202, 126, 137]. These new capabilities may be unanticipated by the original model developer.

Source: MIT AI Risk Repositorymit1103

ENTITY

1 - Human

INTENT

2 - Unintentional

TIMING

1 - Pre-deployment

Risk ID

mit1103

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.2 > AI possessing dangerous capabilities

Mitigation strategy

1. Integrate Safety-Preserving Mechanisms During Fine-Tuning Employ sophisticated data and process controls, such as the strategic mixing of safety data (mimicking the task-specific format) or the use of gradient-based safety-aware probes (e.g., SAP), to actively restrict parameter updates that could erode the model's existing alignment and cause unintentional safety degradation during the fine-tuning optimization phase. 2. Implement Robust Inference-Time Safety Guardrails Deploy dynamic post-fine-tuning safeguards that serve as a final defense layer against revealed unexpected capabilities. This includes methods like Prefix INjection Guard (PING) or advanced system-prompting techniques that guide the fine-tuned model to increase refusal rates for harmful requests while maintaining efficacy on benign, agentic tasks. 3. Mandate Systemic Transparency and Continuous Risk Assessment Establish a governance framework that requires downstream deployers to conduct rigorous pre-deployment evaluations for the emergence of unanticipated, dangerous capabilities. This must be coupled with continuous post-deployment monitoring for anomalous or harmful outputs, including mechanisms for transparent reporting of safety failures back to the upstream model provider.