Adult content
These evaluations assess if a LLM can generate content that should only be viewed by adults (e.g., sexual material or depictions of sexual activity)
ENTITY
1 - Human
INTENT
1 - Intentional
TIMING
3 - Other
Risk ID
mit668
Domain lineage
1. Discrimination & Toxicity
1.2 > Exposure to toxic content
Mitigation strategy
1. **Multi-Layered Output Moderation and Filtering** Implement a robust, multi-layered content filtering system utilizing pre-trained classification models, external moderation APIs (e.g., Azure AI Services Content Filtering), and rule-based filters to actively detect and block all explicit or sexually suggestive content in the LLM's output before it is delivered to the end-user. 2. **Continuous Adversarial Testing and Red-Teaming** Routinely conduct specialized adversarial testing, including multi-turn exploitation techniques (such as Deceptive Delight or Crescendo), to systematically expose and eliminate vulnerabilities in the model's safety guardrails, input validation, and context handling that allow the generation of adult content. 3. **Rigorous Safety Alignment and Training Data Sanitization** Ensure the model has undergone a rigorous safety alignment process (e.g., Reinforcement Learning with Human Feedback, or RLHF) and that its training and fine-tuning datasets are meticulously cleaned and filtered to remove explicit content, thereby minimizing the model's propensity to generate undesirable material at its core.