4. Malicious Actors & Misuse2 - Post-deployment

Fine-tuning related (Harmful fine-tuning of open-weights models)

Models with publicly available weights can be fine-tuned for harmful activities by bad actors, using significantly fewer resources (in terms of time and money) compared to the original training cost [115, 78].

Source: MIT AI Risk Repositorymit1104

ENTITY

1 - Human

INTENT

1 - Intentional

TIMING

2 - Post-deployment

Risk ID

mit1104

Domain lineage

4. Malicious Actors & Misuse

223 mapped risks

4.2 > Cyberattacks, weapon development or use, and mass harm

Mitigation strategy

1. Establish robust fine-tuning governance to continuously monitor datasets, processes, and model provenance, and proactively detect and filter malicious or inappropriate fine-tuning data prior to use. 2. Implement advanced safety-retaining fine-tuning strategies, such as integrating safety data that mimics the user's task format (e.g., paraphrasing) or employing Selective Token Masking (STM), to re-establish safety alignment and mitigate the degradation of guardrails. 3. Deploy run-time safety mechanisms and output filtering to detect and block harmful model generations post-fine-tuning, serving as a critical last line of defense before the output is presented to the end-user.