Back to the MIT repository
1. Discrimination & Toxicity2 - Post-deployment

Violent crimes

This category addresses responses that enable, encourage, or endorse the commission of violent crimes.

Source: MIT AI Risk Repositorymit354

ENTITY

2 - AI

INTENT

3 - Other

TIMING

2 - Post-deployment

Risk ID

mit354

Domain lineage

1. Discrimination & Toxicity

156 mapped risks

1.2 > Exposure to toxic content

Mitigation strategy

1. Implement Robust Output Guardrails: Employ multi-layered, real-time toxicity detection and filtering systems using advanced natural language processing models (e.g., specialized classifiers) to proactively identify and block model responses that enable, encourage, or endorse violent crimes before they reach the end user. This process must be continuously refined with human-in-the-loop validation to minimize false negatives. 2. Reinforce Safety via Dataset Detoxification and Model Fine-tuning: Conduct rigorous curation of training and fine-tuning datasets to systematically identify and eliminate or detoxify content related to violence and harmful instruction. Subsequently, apply techniques like Reinforcement Learning with Human Feedback (RLHF) to explicitly penalize the generation of criminogenic and violent outputs, thereby instilling safety as a core behavioral constraint in the model. 3. Develop and Deploy Adversarial Input Defenses: Utilize sophisticated prompt analysis and sanitization techniques to detect and neutralize adversarial attacks (jailbreaks) designed to circumvent safety guardrails and elicit harmful content. This strategy focuses on identifying and blocking the intent to generate violent endorsements at the input stage, often using mechanisms like prompt rephrasing or contextual restriction.