2. Privacy & Security1 - Pre-deployment

Jailbreak in LLM Malicious Use - Poisoning Training Data

In the data collecting and pre-training phase, malicious adversaries can Jailbreak LLMs through poisoning their training data to make the model to output harmful content.

Source: MIT AI Risk Repositorymit1515

ENTITY

1 - Human

INTENT

1 - Intentional

TIMING

1 - Pre-deployment

Risk ID

mit1515

Domain lineage

2. Privacy & Security

186 mapped risks

2.2 > AI system security vulnerabilities and attacks

Mitigation strategy

1. Implement rigorous Data Provenance, Validation, and Curation. Enforce automated, repeatable validation protocols, including schema checks and statistical outlier detection on all training and fine-tuning datasets, prior to model ingestion. Establish a clear chain of data lineage (e.g., using ML-BOM) and prioritize curated, version-controlled sources to reduce the exposure to upstream manipulation. 2. Conduct continuous Adversarial Testing and Real-Time Behavioral Monitoring. Employ structured red-teaming simulations to proactively test for and confirm the efficacy of defenses against stealthy data poisoning and backdoor attacks. In production, utilize real-time observability to track model performance, detect unexpected drift, and employ "canary" prompts designed to reveal hidden compromised behaviors upon trigger activation. 3. Establish a structured Data and Model Integrity Incident Response Plan. Maintain versioned backups of verified clean datasets and model checkpoints to enable the immediate and automated rollback to a known-good state upon detection of a poisoning event. This must be followed by a full-scale incident investigation, root cause analysis, and the re-sanitization of data pipelines to patch vulnerabilities and prevent recurrence.