Back to the MIT repository
1. Discrimination & Toxicity2 - Post-deployment

Toxicity in LLM Malicious Use

Toxicity in LLMs refers to the generation of harmful, offensive, or inappropriate content that can cause harm to individuals or groups. Both explicit and implicit forms of toxicity can be generated by LLMs, posing significant risks to society. Explicit toxicity encompasses a wide range of negative behaviors, including hate speech, harassment, cyberbullying, rude, and disrespectful comments, derogatory language, as well as allocational harms [2, 62, 90]. Besides, implicit toxicity does not involve overtly harmful language but may manifest through subtle forms such as sarcasm, irony, and humor, making it more difficult to detect [103, 213].

Source: MIT AI Risk Repositorymit1514

ENTITY

2 - AI

INTENT

3 - Other

TIMING

2 - Post-deployment

Risk ID

mit1514

Domain lineage

1. Discrimination & Toxicity

156 mapped risks

1.2 > Exposure to toxic content

Mitigation strategy

1. Proactive Data Curation and Sanitization Proactive curation and cleansing of the pre-training and fine-tuning datasets to eliminate toxic, biased, and low-quality content, thereby mitigating the model's propensity to learn and replicate harmful language. This foundational strategy may include leveraging differential privacy techniques to further sanitize sensitive information. 2. Advanced Model Alignment and Optimization Implementation of advanced alignment techniques, such as Reinforcement Learning from Human Feedback (RLHF) or Direct Preference Optimization (DPO), to fine-tune the LLM to adhere to the harmlessness criterion. This ensures the model's responses reliably align with ethical standards and robustly implement policy-compliant refusal strategies against malicious prompts. 3. Multi-Layered Run-Time Guardrails Deployment of multi-layered run-time guardrails, incorporating external toxicity classifiers and continuous output monitoring, to detect and block malicious inputs and filter/rewrite toxic outputs in real time. This defense mechanism is crucial for identifying subtle, implicit forms of toxicity like sarcasm and coded language that are difficult to detect via surface-level analysis.