Fine-tuning related (Harmful fine-tuning of open-weights models)
Models with publicly available weights can be fine-tuned for harmful activities by bad actors, using significantly fewer resources (in terms of time and money) compared to the original training cost [115, 78].
ENTITY
1 - Human
INTENT
1 - Intentional
TIMING
2 - Post-deployment
Risk ID
mit1104
Domain lineage
4. Malicious Actors & Misuse
4.2 > Cyberattacks, weapon development or use, and mass harm
Mitigation strategy
1. Establish robust fine-tuning governance to continuously monitor datasets, processes, and model provenance, and proactively detect and filter malicious or inappropriate fine-tuning data prior to use. 2. Implement advanced safety-retaining fine-tuning strategies, such as integrating safety data that mimics the user's task format (e.g., paraphrasing) or employing Selective Token Masking (STM), to re-establish safety alignment and mitigate the degradation of guardrails. 3. Deploy run-time safety mechanisms and output filtering to detect and block harmful model generations post-fine-tuning, serving as a critical last line of defense before the output is presented to the end-user.