Fine-tuning related (Fine-tuning dataset poisoning)
A deployer can poison the dataset used during the fine-tuning process [98] to induce specific, often malicious, behaviors in a model. This can be performed without having access to the model’s weights. This poisoning can be difficult to detect through direct inspection of the dataset, as the manipulations may be subtle and targeted.
ENTITY
1 - Human
INTENT
1 - Intentional
TIMING
1 - Pre-deployment
Risk ID
mit1105
Domain lineage
2. Privacy & Security
2.2 > AI system security vulnerabilities and attacks
Mitigation strategy
1. Establish Data Provenance and Integrity Vetting: Implement robust mechanisms, such as ML-BOM, to track data origins and transformations, and rigorously vet all third-party data sources against trusted baselines to prevent the ingestion of corrupted or malicious samples into the fine-tuning corpus. 2. Perform Adversarial Red Teaming and Continuous Behavior Monitoring: Execute structured adversarial testing campaigns (red teaming) to proactively identify potential backdoors and triggered behaviors. Simultaneously, continuously monitor model performance and training loss during and after fine-tuning, applying anomaly detection to flag and quarantine suspicious model outputs. 3. Enforce Strict Access Controls and Sandboxing: Utilize granular access controls to restrict unauthorized modification of fine-tuning datasets and training environments. Furthermore, implement sandboxing to limit the model's exposure to unverified or potentially compromised external data sources during the fine-tuning process.