1. Discrimination & Toxicity2 - Post-deployment

Violence

LLMs are found to generate answers that contain violent content or generate content that responds to questions that solicit information about violent behaviors

Source: MIT AI Risk Repositorymit482

ENTITY

2 - AI

INTENT

1 - Intentional

TIMING

2 - Post-deployment

Risk ID

mit482

Domain lineage

1. Discrimination & Toxicity

156 mapped risks

1.2 > Exposure to toxic content

Mitigation strategy

1. Implement robust, multi-layered output filtering mechanisms and AI guardrails to scan, detect, and block the generation of violent, harmful, or toxic content in real-time before it reaches the end-user. This acts as the final runtime defense layer. 2. Conduct systematic Responsible AI (RAI) red teaming using both human experts and automated adversarial testing to proactively identify and document vulnerabilities, particularly those that enable the circumvention of safety protocols to elicit violent outputs. 3. Employ advanced data curation and sanitation techniques on all training and fine-tuning datasets to eliminate the source of violent or toxic content, followed by the application of Reinforcement Learning from Human Feedback (RLHF) to align the model's behavior with explicit, strict ethical guidelines prohibiting the generation of violent content.