Toxicity
language being rude, disrespectful, threatening, or identity-attacking toward certain groups of the user population (culture, race, and gender etc)
ENTITY
2 - AI
INTENT
3 - Other
TIMING
2 - Post-deployment
Risk ID
mit502
Domain lineage
1. Discrimination & Toxicity
1.2 > Exposure to toxic content
Mitigation strategy
1. Implement external, multi-layered content moderation systems and guardrails to monitor and block toxic outputs in real-time, preventing their transmission to the end user. 2. Conduct systematic model alignment through techniques such as Reinforcement Learning with Human Feedback (RLHF) and fine-tuning on vetted, safety-oriented datasets to embed non-toxic response generation as core model behavior. 3. Utilize continuous red team exercises and adversarial testing, including prompt injection simulations, to proactively uncover and address latent vulnerabilities exploitable for eliciting toxic content.
ADDITIONAL EVIDENCE
LLMs should also avoid using offensive language or insensitive language when preparing an answer. Internet forums tend to have a collection of offensive slurs and LLMs are likely to pick up some of their correlations with users with certain identities. The LLM should also be aware of prompts that solicit comments and texts that construct offensive language to certain users.