Violence
LLMs are found to generate answers that contain violent content or generate content that responds to questions that solicit information about violent behaviors
ENTITY
2 - AI
INTENT
1 - Intentional
TIMING
2 - Post-deployment
Risk ID
mit482
Domain lineage
1. Discrimination & Toxicity
1.2 > Exposure to toxic content
Mitigation strategy
1. Implement robust, multi-layered output filtering mechanisms and AI guardrails to scan, detect, and block the generation of violent, harmful, or toxic content in real-time before it reaches the end-user. This acts as the final runtime defense layer. 2. Conduct systematic Responsible AI (RAI) red teaming using both human experts and automated adversarial testing to proactively identify and document vulnerabilities, particularly those that enable the circumvention of safety protocols to elicit violent outputs. 3. Employ advanced data curation and sanitation techniques on all training and fine-tuning datasets to eliminate the source of violent or toxic content, followed by the application of Reinforcement Learning from Human Feedback (RLHF) to align the model's behavior with explicit, strict ethical guidelines prohibiting the generation of violent content.