Back to the MIT repository
2. Privacy & Security1 - Pre-deployment

Fine-tuning related (Poisoning models during instruction tuning)

AI models can be poisoned during instruction tuning when models are tuned using pairs of instructions and desired outputs. Poisoning in instruction tuning can be achieved with a lower number of compromised samples, as instruction tuning requires a relatively small number of samples for fine-tuning [155, 211]. Anonymous crowdsourcing efforts may be employed in collecting instruction tuning datasets and can further contribute to poisoning attacks [187]. These attacks might be harder to detect than traditional data poisoning attacks.

Source: MIT AI Risk Repositorymit1106

ENTITY

1 - Human

INTENT

1 - Intentional

TIMING

1 - Pre-deployment

Risk ID

mit1106

Domain lineage

2. Privacy & Security

186 mapped risks

2.2 > AI system security vulnerabilities and attacks

Mitigation strategy

1. Implement rigorous, pre-deployment data integrity checks on instruction-tuning datasets by leveraging advanced techniques such as influence function analysis or high-loss sample flagging to systematically detect and remove malicious, poisoned examples, thereby ensuring the foundational cleanliness of the training corpus. 2. Employ active defense training pipelines, such as the MB-Defense framework, which integrates complementary stages of defensive poisoning and weight recovery to neutralize both attacker-injected and defensive backdoor representations during the fine-tuning process. 3. Integrate post-training mechanisms, specifically utilizing In-Context Learning (ICL) during inference with clean, trusted examples to prompt correct model behavior, and apply Reinforcement Learning from Human Feedback (RLHF) with clean demonstrations to mitigate residual backdoor vulnerabilities.