Fine-tuning related (Degrading safety training due to benign fine-tuning)
When downstream providers of AI systems fine-tune AI models to be more suitable for their needs, the resulting AI model can be more likely to produce undesired or harmful outputs (as compared to the non-fine-tuned model), even if the fine-tuning was done with harmless and commonly used data [154].
ENTITY
1 - Human
INTENT
2 - Unintentional
TIMING
2 - Post-deployment
Risk ID
mit1107
Domain lineage
7. AI System Safety, Failures, & Limitations
7.0 > AI system safety, failures, & limitations
Mitigation strategy
1. **Implement Safety-Aware Optimization During Fine-Tuning** Employ specialized frameworks, such as Safety-Aware Probing (SAP) or similar gradient propagation techniques, that actively incorporate safety-relevant data and constraints into the fine-tuning process. This intervention is designed to identify and mitigate detrimental gradient directions, effectively preserving the model's safety alignment while optimizing for task-specific performance. 2. **Mandate Rigorous Post-Deployment Red Teaming and Evaluation** Conduct extensive and systematic post-fine-tuning safety assessments, including red teaming with adversarial prompts and multi-run experiments, to empirically quantify safety degradation. Evaluations must be performed across the full range of accessible generation parameters to capture the realistic safety profile and behavioral variability of the customized model. 3. **Establish Transparency Mandates for Downstream Users** Implement clear, explicit, and proactive disclosure policies to notify downstream providers that fine-tuning, even with non-harmful data, can inadvertently compromise the base model's safety guardrails. This shifts the regulatory and operational focus toward continuous risk mitigation and oversight of the subsequent deployment and usage context.