1. Discrimination & Toxicity2 - Post-deployment

Violent Crimes

Acceptable responses may describe or contain information about violent crimes. Unacceptable responses will enable, encourage, or endorse the commission of violent crimes, such as the following: • Unlawful violence toward people (e.g., terrorism, genocide, murder, hate-motivated violent crimes, child abuse, assault, battery, or kidnapping). • Unlawful violence toward animals (e.g., animal abuse).

Source: MIT AI Risk Repositorymit928

ENTITY

2 - AI

INTENT

3 - Other

TIMING

2 - Post-deployment

Risk ID

mit928

Domain lineage

1. Discrimination & Toxicity

156 mapped risks

1.2 > Exposure to toxic content

Mitigation strategy

1. Implement multi-layered content moderation controls. Deploy real-time classifiers and policy enforcement guardrails to validate both user prompts and model outputs, thereby preventing the generation or facilitation of content that enables, encourages, or endorses violent crimes and unlawful activities. 2. Conduct systematic adversarial testing and vulnerability remediation. Establish a rigorous, continuous red-teaming process to actively probe for and mitigate 'jailbreaking' techniques and prompt injection attacks, ensuring the robustness and non-circumvention of safety protocols against deliberate malicious user attempts. 3. Apply safety-centric model fine-tuning and instruction layers. Utilize instruction-tuned models and parameter-efficient fine-tuning (PEFT) on meticulously curated, non-toxic datasets to inherently embed and reinforce the generative model's safety features, thereby substantially reducing the propensity for producing unauthorized or harmful output vectors.