Adult Content
LLMs have the capability to generate sex-explicit conversations, and erotic texts, and to recommend websites with sexual content
ENTITY
2 - AI
INTENT
1 - Intentional
TIMING
2 - Post-deployment
Risk ID
mit485
Domain lineage
1. Discrimination & Toxicity
1.2 > Exposure to toxic content
Mitigation strategy
1. Deployment of Multi-Layered Output Guardrails: Implement robust, real-time content filtering systems utilizing neural multiclass classification models to immediately detect and block the generation of sexually explicit text, erotic conversations, or sexually offensive language in the output stream. This necessitates incorporating post-generation validation with configurable severity thresholds (Low, Medium, High) to maintain compliance with safety policies. 2. Mandatory Input and Prompt Sanitization: Enforce stringent input validation and sanitization protocols to actively prevent adversarial inputs, such as jailbreaking attempts or NSFW prompts, that are engineered to bypass safety features and elicit harmful content generation. User input must be logically isolated from core system prompts to minimize the attack surface for prompt injection. 3. Foundational Training Data Detoxification: Prioritize the systematic curation, filtering, and detoxification of the model's training and fine-tuning datasets to remove or mitigate the presence of toxic, explicit, and biased source content. This long-term strategy addresses the root capability of the LLM to reproduce harmful material.
ADDITIONAL EVIDENCE
Combining with image generative models [ 139, 140] and LLMs’ inherent code generation power for synthesizing images [62], new concerns arise when users use LLM’s multi-modality function for contents. Users can also potentially use LLMs to elicit sexually offensive language toward certain users.