Jailbreak in LLM Malicious Use - Poisoning Training Data
In the data collecting and pre-training phase, malicious adversaries can Jailbreak LLMs through poisoning their training data to make the model to output harmful content.
ENTITY
1 - Human
INTENT
1 - Intentional
TIMING
1 - Pre-deployment
Risk ID
mit1515
Domain lineage
2. Privacy & Security
2.2 > AI system security vulnerabilities and attacks
Mitigation strategy
1. Implement rigorous Data Provenance, Validation, and Curation. Enforce automated, repeatable validation protocols, including schema checks and statistical outlier detection on all training and fine-tuning datasets, prior to model ingestion. Establish a clear chain of data lineage (e.g., using ML-BOM) and prioritize curated, version-controlled sources to reduce the exposure to upstream manipulation. 2. Conduct continuous Adversarial Testing and Real-Time Behavioral Monitoring. Employ structured red-teaming simulations to proactively test for and confirm the efficacy of defenses against stealthy data poisoning and backdoor attacks. In production, utilize real-time observability to track model performance, detect unexpected drift, and employ "canary" prompts designed to reveal hidden compromised behaviors upon trigger activation. 3. Establish a structured Data and Model Integrity Incident Response Plan. Maintain versioned backups of verified clean datasets and model checkpoints to enable the immediate and automated rollback to a known-good state upon detection of a poisoning event. This must be followed by a full-scale incident investigation, root cause analysis, and the re-sanitization of data pipelines to patch vulnerabilities and prevent recurrence.