Back to the MIT repository
1. Discrimination & Toxicity2 - Post-deployment

Not-Suitable-for-Work (NSFW) Prompts

Inputting a prompt contain an unsafe topic (e.g., notsuitable-for-work (NSFW) content) by a benign user.

Source: MIT AI Risk Repositorymit51

ENTITY

1 - Human

INTENT

2 - Unintentional

TIMING

2 - Post-deployment

Risk ID

mit51

Domain lineage

1. Discrimination & Toxicity

156 mapped risks

1.2 > Exposure to toxic content

Mitigation strategy

1. Implement an Inference-Time Prompt Sanitization Mechanism Deploy a prompt modification system (e.g., PromptSan-Modify or PromptGuard) at the text encoder input stage. This system must use an integrated NSFW classifier and gradient-based token importance scoring to identify and modify only the most harmful token embeddings in the user's prompt, effectively steering the input away from regions associated with unsafe content without altering the core model's weights. 2. Enforce Neural Multi-Class Content Moderation Guardrails Integrate a robust, real-time content filtering system that uses neural multiclass classification models to analyze the user's input prompt across specific categories (e.g., sexual, violence, hate) and severity levels. Prompts exceeding a predefined medium or high severity threshold must be blocked or return an API error (HTTP 400) prior to image generation. 3. Apply Targeted Safety Fine-Tuning to the Model Conduct a targeted safety fine-tuning process (e.g., NSFW-Intervention) on the core text-to-image diffusion model. This process should train the model using pairs of NSFW prompts and their benign image targets, encouraging the model to align its denoised prediction with the safe image, even when conditioned on the original unsafe text, to structurally suppress the generation of harmful content.