1. Discrimination & Toxicity2 - Post-deployment

Unintentional generation of harmful content

Generative models can create harmful or discriminatory content from benign user requests. Models can exhibit bias to particular harmful styles of generation (e.g., sexualization of photos of women [87] in the case of image generation models) or they can generate toxic, misleading, or violent data (e.g., a model generating jokes can use ethnic stereotypes or slurs to deliver humor).

Source: MIT AI Risk Repositorymit1179

ENTITY

2 - AI

INTENT

2 - Unintentional

TIMING

2 - Post-deployment

Risk ID

mit1179

Domain lineage

1. Discrimination & Toxicity

156 mapped risks

1.2 > Exposure to toxic content

Mitigation strategy

1. Establish a robust, multi-layered content filtering and moderation system (guardrails) on the model's output, utilizing neural classification models to detect and block content related to hate, toxicity, and discrimination in real-time before it reaches the end-user. 2. Employ Reinforcement Learning from Human Feedback (RLHF) and targeted fine-tuning to align the generative model's internal behaviors with defined safety and ethical principles, systematically penalizing outputs flagged as toxic or discriminatory to reduce the propensity for unintentional harm. 3. Conduct continuous adversarial testing (Red Teaming) and bias audits post-deployment to proactively identify latent biases and novel prompt-based triggers for toxic content, feeding these vulnerabilities back into the model alignment and content filter refinement loop.