Unintentional generation of harmful content
Generative models can create harmful or discriminatory content from benign user requests. Models can exhibit bias to particular harmful styles of generation (e.g., sexualization of photos of women [87] in the case of image generation models) or they can generate toxic, misleading, or violent data (e.g., a model generating jokes can use ethnic stereotypes or slurs to deliver humor).
ENTITY
2 - AI
INTENT
2 - Unintentional
TIMING
2 - Post-deployment
Risk ID
mit1179
Domain lineage
1. Discrimination & Toxicity
1.2 > Exposure to toxic content
Mitigation strategy
1. Establish a robust, multi-layered content filtering and moderation system (guardrails) on the model's output, utilizing neural classification models to detect and block content related to hate, toxicity, and discrimination in real-time before it reaches the end-user. 2. Employ Reinforcement Learning from Human Feedback (RLHF) and targeted fine-tuning to align the generative model's internal behaviors with defined safety and ethical principles, systematically penalizing outputs flagged as toxic or discriminatory to reduce the propensity for unintentional harm. 3. Conduct continuous adversarial testing (Red Teaming) and bias audits post-deployment to proactively identify latent biases and novel prompt-based triggers for toxic content, feeding these vulnerabilities back into the model alignment and content filter refinement loop.