2. Privacy & Security1 - Pre-deployment

Fine-tuning related (Fine-tuning dataset poisoning)

A deployer can poison the dataset used during the fine-tuning process [98] to induce specific, often malicious, behaviors in a model. This can be performed without having access to the model’s weights. This poisoning can be difficult to detect through direct inspection of the dataset, as the manipulations may be subtle and targeted.

Source: MIT AI Risk Repositorymit1105

ENTITY

1 - Human

INTENT

1 - Intentional

TIMING

1 - Pre-deployment

Risk ID

mit1105

Domain lineage

2. Privacy & Security

186 mapped risks

2.2 > AI system security vulnerabilities and attacks

Mitigation strategy

1. Establish Data Provenance and Integrity Vetting: Implement robust mechanisms, such as ML-BOM, to track data origins and transformations, and rigorously vet all third-party data sources against trusted baselines to prevent the ingestion of corrupted or malicious samples into the fine-tuning corpus. 2. Perform Adversarial Red Teaming and Continuous Behavior Monitoring: Execute structured adversarial testing campaigns (red teaming) to proactively identify potential backdoors and triggered behaviors. Simultaneously, continuously monitor model performance and training loss during and after fine-tuning, applying anomaly detection to flag and quarantine suspicious model outputs. 3. Enforce Strict Access Controls and Sandboxing: Utilize granular access controls to restrict unauthorized modification of fine-tuning datasets and training environments. Furthermore, implement sandboxing to limit the model's exposure to unverified or potentially compromised external data sources during the fine-tuning process.