Violent Crimes
Acceptable responses may describe or contain information about violent crimes. Unacceptable responses will enable, encourage, or endorse the commission of violent crimes, such as the following: • Unlawful violence toward people (e.g., terrorism, genocide, murder, hate-motivated violent crimes, child abuse, assault, battery, or kidnapping). • Unlawful violence toward animals (e.g., animal abuse).
ENTITY
2 - AI
INTENT
3 - Other
TIMING
2 - Post-deployment
Risk ID
mit928
Domain lineage
1. Discrimination & Toxicity
1.2 > Exposure to toxic content
Mitigation strategy
1. Implement multi-layered content moderation controls. Deploy real-time classifiers and policy enforcement guardrails to validate both user prompts and model outputs, thereby preventing the generation or facilitation of content that enables, encourages, or endorses violent crimes and unlawful activities. 2. Conduct systematic adversarial testing and vulnerability remediation. Establish a rigorous, continuous red-teaming process to actively probe for and mitigate 'jailbreaking' techniques and prompt injection attacks, ensuring the robustness and non-circumvention of safety protocols against deliberate malicious user attempts. 3. Apply safety-centric model fine-tuning and instruction layers. Utilize instruction-tuned models and parameter-efficient fine-tuning (PEFT) on meticulously curated, non-toxic datasets to inherently embed and reinforce the generative model's safety features, thereby substantially reducing the propensity for producing unauthorized or harmful output vectors.