Generation of illegal or harmful content
Generative models can create illegal, harmful, or discriminatory content [196], such as sexual abuse material, at scale. Current access controls (e.g., API access filters) are not effective against all user queries in generating such content.
ENTITY
2 - AI
INTENT
3 - Other
TIMING
2 - Post-deployment
Risk ID
mit1178
Domain lineage
1. Discrimination & Toxicity
1.2 > Exposure to toxic content
Mitigation strategy
1. Deploy a multi-stage content moderation framework utilizing runtime guardrails and prompt attack filters to systematically screen and block both user inputs and model outputs for policy-violating, illegal, or toxic content. This is essential for preventing real-time exposure to harm and mitigating adversarial manipulation attempts (e.g., prompt injections, jailbreaking). 2. Systematically employ adversarial testing, such as 'Red Teaming,' to proactively identify and rectify model vulnerabilities that lead to the generation of harmful outputs, followed by fine-tuning or Reinforcement Learning from Human Feedback (RLHF) to align model behavior with established ethical and safety principles. 3. Institute a formal, mandated AI Governance Framework, led by a dedicated oversight committee, to establish and enforce documented principles and policies that ensure responsible AI deployment, secure the necessary resources, and maintain explicit leadership accountability for all toxicity and safety-related mitigation initiatives.