1. Discrimination & Toxicity2 - Post-deployment

Generation of illegal or harmful content

Generative models can create illegal, harmful, or discriminatory content [196], such as sexual abuse material, at scale. Current access controls (e.g., API access filters) are not effective against all user queries in generating such content.

Source: MIT AI Risk Repositorymit1178

ENTITY

2 - AI

INTENT

3 - Other

TIMING

2 - Post-deployment

Risk ID

mit1178

Domain lineage

1. Discrimination & Toxicity

156 mapped risks

1.2 > Exposure to toxic content

Mitigation strategy

1. Deploy a multi-stage content moderation framework utilizing runtime guardrails and prompt attack filters to systematically screen and block both user inputs and model outputs for policy-violating, illegal, or toxic content. This is essential for preventing real-time exposure to harm and mitigating adversarial manipulation attempts (e.g., prompt injections, jailbreaking). 2. Systematically employ adversarial testing, such as 'Red Teaming,' to proactively identify and rectify model vulnerabilities that lead to the generation of harmful outputs, followed by fine-tuning or Reinforcement Learning from Human Feedback (RLHF) to align model behavior with established ethical and safety principles. 3. Institute a formal, mandated AI Governance Framework, led by a dedicated oversight committee, to establish and enforce documented principles and policies that ensure responsible AI deployment, secure the necessary resources, and maintain explicit leadership accountability for all toxicity and safety-related mitigation initiatives.