Back to the MIT repository
1. Discrimination & Toxicity3 - Other

Toxicity generation

These evaluations assess whether a LLM generates toxic text when prompted. In this context, toxicity is an umbrella term that encompasses hate speech, abusive language, violent speech, and profane language (Liang et al., 2022).

Source: MIT AI Risk Repositorymit647

ENTITY

2 - AI

INTENT

3 - Other

TIMING

3 - Other

Risk ID

mit647

Domain lineage

1. Discrimination & Toxicity

156 mapped risks

1.2 > Exposure to toxic content

Mitigation strategy

1. Pre-training Data Curation and Enhancement Rigorously implement data cleansing and filtering protocols to remove all forms of toxic, offensive, and biased content from pre-training and fine-tuning corpora, as imperfect training data is the primary source of learned toxicity. This should be complemented by enriching datasets with safety-explicit contextual attributes to guide the model toward inherently safer conditional generation. 2. Advanced Safety Alignment and Model Intervention Employ post-training alignment techniques, such as Reinforcement Learning with Human Feedback (RLHF), to embed human safety preferences directly into the model's behavior. Furthermore, utilize mathematically principled, training-free intervention methods, like eigen-decomposition (e.g., EigenShift) on the final output layer, to selectively steer the model's generation away from toxic linguistic components in real-time without compromising linguistic competence. 3. Multi-Layered Runtime Guardrails and Adversarial Validation Deploy a multi-layered content moderation system at the inference stage, incorporating automated machine learning classifiers, prompt refusal mechanisms, and keyword filtering, to block potentially harmful outputs before they reach the end-user. This system must be continuously evaluated and hardened against sophisticated adversarial attacks (red teaming) to ensure robustness against evolving jailbreaking and prompt injection techniques.