Vulnerability to Poisoning and Backdoors
The previous section explored jailbreaks and other forms of adversarial prompts as ways to elicit harmful capabilities acquired during pretraining. These methods make no assumptions about the training data. On the other hand, poisoning attacks (Biggio et al., 2012) perturb training data to introduce specific vulnerabilities, called backdoors, that can then be exploited at inference time by the adversary. This is a challenging problem in current large language models because they are trained on data gathered from untrusted sources (e.g. internet), which can easily be poisoned by an adversary (Carlini et al., 2023b).
ENTITY
1 - Human
INTENT
1 - Intentional
TIMING
1 - Pre-deployment
Risk ID
mit1506
Domain lineage
2. Privacy & Security
2.2 > AI system security vulnerabilities and attacks
Mitigation strategy
1. Implement rigorous Data Provenance and Input Sanitization: Establish comprehensive data governance protocols to track the origin, lineage, and transformation history of all training and fine-tuning datasets. Employ statistical outlier detection and anomaly detection techniques to rigorously sanitize and validate inputs, thereby preventing the introduction of maliciously perturbed or backdoored data into the model corpus during pre-deployment stages. 2. Utilize Model Hardening and Repair Techniques: Apply post-training model repair strategies, such as structured pruning methods (e.g., gradient-based or layer-wise pruning), to systematically remove model components (neurons or attention heads) empirically associated with backdoor-activated pathways. Concurrently, incorporate adversarial training into the fine-tuning process to proactively enhance the model's robustness against known and novel adversarial examples. 3. Deploy Runtime Consistency and Output Verification: Institute an inference-time defense mechanism that verifies the internal consistency between the user's prompt, the Large Language Model's generated reasoning/plan, and its ultimate execution or action. This two-level approach should be layered with continuous output monitoring and filtering guardrails to detect and mitigate anomalous or policy-violating behavior triggered by any backdoors that evaded training-time defenses.