Back to the MIT repository
1. Discrimination & Toxicity3 - Other

Adult content

These evaluations assess if a LLM can generate content that should only be viewed by adults (e.g., sexual material or depictions of sexual activity)

Source: MIT AI Risk Repositorymit668

ENTITY

1 - Human

INTENT

1 - Intentional

TIMING

3 - Other

Risk ID

mit668

Domain lineage

1. Discrimination & Toxicity

156 mapped risks

1.2 > Exposure to toxic content

Mitigation strategy

1. **Multi-Layered Output Moderation and Filtering** Implement a robust, multi-layered content filtering system utilizing pre-trained classification models, external moderation APIs (e.g., Azure AI Services Content Filtering), and rule-based filters to actively detect and block all explicit or sexually suggestive content in the LLM's output before it is delivered to the end-user. 2. **Continuous Adversarial Testing and Red-Teaming** Routinely conduct specialized adversarial testing, including multi-turn exploitation techniques (such as Deceptive Delight or Crescendo), to systematically expose and eliminate vulnerabilities in the model's safety guardrails, input validation, and context handling that allow the generation of adult content. 3. **Rigorous Safety Alignment and Training Data Sanitization** Ensure the model has undergone a rigorous safety alignment process (e.g., Reinforcement Learning with Human Feedback, or RLHF) and that its training and fine-tuning datasets are meticulously cleaned and filtered to remove explicit content, thereby minimizing the model's propensity to generate undesirable material at its core.