1. Discrimination & Toxicity2 - Post-deployment

Safety

Avoiding unsafe and illegal outputs, and leaking private information

Source: MIT AI Risk Repositorymit481

ENTITY

2 - AI

INTENT

3 - Other

TIMING

2 - Post-deployment

Risk ID

mit481

Domain lineage

1. Discrimination & Toxicity

156 mapped risks

1.2 > Exposure to toxic content

Mitigation strategy

1. Robust Fine-tuning and Alignment for Content Safety Implement post-training alignment techniques, such as Reinforcement Learning from Human Feedback (RLHF) or AI Feedback (RLAIF), to fine-tune the model to consistently adhere to safety policies. This should condition the model to refuse and reject the generation of unsafe, illegal, or overtly toxic content (e.g., hate speech, explicit material, instructions for dangerous activities). 2. Multi-Layer Input/Output Controls with Data Sanitization Deploy multi-layer controls including real-time input filtering to block high-risk or sensitive prompts, and output filtering using advanced classifiers and redaction techniques. For privacy leakage mitigation, this includes employing tokenization and data sanitization methods to identify and scrub Personal Identifiable Information (PII) or sensitive business data from model outputs before they reach the user. 3. Proactive Adversarial Testing and Privacy-Preserving Deployment Conduct systematic safety testing, such as regular red-teaming exercises and prompt injection vulnerability assessments, to identify and rectify model safety failures. Concurrently, adopt privacy-by-design principles during deployment, utilizing Privacy-Preserving Machine Learning (PPML) techniques such as Differential Privacy or secure multi-party computation to reduce the risk of training data memorization and subsequent leakage.

ADDITIONAL EVIDENCE

safety to be an important topic because it impacts almost all applications and users, and unsafe outputs can lead to a diverse array of mental harm to users and public relations risks to the platform