1. Discrimination & Toxicity2 - Post-deployment

Toxicity

language being rude, disrespectful, threatening, or identity-attacking toward certain groups of the user population (culture, race, and gender etc)

Source: MIT AI Risk Repositorymit502

ENTITY

2 - AI

INTENT

3 - Other

TIMING

2 - Post-deployment

Risk ID

mit502

Domain lineage

1. Discrimination & Toxicity

156 mapped risks

1.2 > Exposure to toxic content

Mitigation strategy

1. Implement external, multi-layered content moderation systems and guardrails to monitor and block toxic outputs in real-time, preventing their transmission to the end user. 2. Conduct systematic model alignment through techniques such as Reinforcement Learning with Human Feedback (RLHF) and fine-tuning on vetted, safety-oriented datasets to embed non-toxic response generation as core model behavior. 3. Utilize continuous red team exercises and adversarial testing, including prompt injection simulations, to proactively uncover and address latent vulnerabilities exploitable for eliciting toxic content.

ADDITIONAL EVIDENCE

LLMs should also avoid using offensive language or insensitive language when preparing an answer. Internet forums tend to have a collection of offensive slurs and LLMs are likely to pick up some of their correlations with users with certain identities. The LLM should also be aware of prompts that solicit comments and texts that construct offensive language to certain users.