Fine-tuning related (Poisoning models during instruction tuning)
AI models can be poisoned during instruction tuning when models are tuned using pairs of instructions and desired outputs. Poisoning in instruction tuning can be achieved with a lower number of compromised samples, as instruction tuning requires a relatively small number of samples for fine-tuning [155, 211]. Anonymous crowdsourcing efforts may be employed in collecting instruction tuning datasets and can further contribute to poisoning attacks [187]. These attacks might be harder to detect than traditional data poisoning attacks.
ENTITY
1 - Human
INTENT
1 - Intentional
TIMING
1 - Pre-deployment
Risk ID
mit1106
Domain lineage
2. Privacy & Security
2.2 > AI system security vulnerabilities and attacks
Mitigation strategy
1. Implement rigorous, pre-deployment data integrity checks on instruction-tuning datasets by leveraging advanced techniques such as influence function analysis or high-loss sample flagging to systematically detect and remove malicious, poisoned examples, thereby ensuring the foundational cleanliness of the training corpus. 2. Employ active defense training pipelines, such as the MB-Defense framework, which integrates complementary stages of defensive poisoning and weight recovery to neutralize both attacker-injected and defensive backdoor representations during the fine-tuning process. 3. Integrate post-training mechanisms, specifically utilizing In-Context Learning (ICL) during inference with clean, trusted examples to prompt correct model behavior, and apply Reinforcement Learning from Human Feedback (RLHF) with clean demonstrations to mitigate residual backdoor vulnerabilities.