Back to the MIT repository
1. Discrimination & Toxicity2 - Post-deployment

Adult Content

LLMs have the capability to generate sex-explicit conversations, and erotic texts, and to recommend websites with sexual content

Source: MIT AI Risk Repositorymit485

ENTITY

2 - AI

INTENT

1 - Intentional

TIMING

2 - Post-deployment

Risk ID

mit485

Domain lineage

1. Discrimination & Toxicity

156 mapped risks

1.2 > Exposure to toxic content

Mitigation strategy

1. Deployment of Multi-Layered Output Guardrails: Implement robust, real-time content filtering systems utilizing neural multiclass classification models to immediately detect and block the generation of sexually explicit text, erotic conversations, or sexually offensive language in the output stream. This necessitates incorporating post-generation validation with configurable severity thresholds (Low, Medium, High) to maintain compliance with safety policies. 2. Mandatory Input and Prompt Sanitization: Enforce stringent input validation and sanitization protocols to actively prevent adversarial inputs, such as jailbreaking attempts or NSFW prompts, that are engineered to bypass safety features and elicit harmful content generation. User input must be logically isolated from core system prompts to minimize the attack surface for prompt injection. 3. Foundational Training Data Detoxification: Prioritize the systematic curation, filtering, and detoxification of the model's training and fine-tuning datasets to remove or mitigate the presence of toxic, explicit, and biased source content. This long-term strategy addresses the root capability of the LLM to reproduce harmful material.

ADDITIONAL EVIDENCE

Combining with image generative models [ 139, 140] and LLMs’ inherent code generation power for synthesizing images [62], new concerns arise when users use LLM’s multi-modality function for contents. Users can also potentially use LLMs to elicit sexually offensive language toward certain users.