Back to the MIT repository
1. Discrimination & Toxicity2 - Post-deployment

Toxic output

Toxic output occurs when the model produces hateful, abusive, and profane (HAP) or obscene content. This also includes behaviors like bullying.

Source: MIT AI Risk Repositorymit1307

ENTITY

2 - AI

INTENT

3 - Other

TIMING

2 - Post-deployment

Risk ID

mit1307

Domain lineage

1. Discrimination & Toxicity

156 mapped risks

1.2 > Exposure to toxic content

Mitigation strategy

1. **Safety-Aligned Training and In-Model Guardrails** Implement advanced training methodologies, such as Reinforcement Learning from Human Feedback (RLHF) or Constitutional AI, to fundamentally align the model's behavior with established safety policies, thereby reducing the intrinsic probability of generating HAP, obscene, or bullying content at the source. This constitutes the highest-priority, preventative control. 2. **Real-Time Output Classification and Filtering** Deploy a highly-tuned, low-latency safety classifier post-generation but prior to user display (a "safety filter" or "output guardrail"). This system must operate as an engineering control to detect, block, or immediately rewrite any potentially toxic output, ensuring exposure to prohibited content is near-zero. 3. **Continuous Post-Deployment Monitoring and Policy Enforcement** Establish a continuous auditing and telemetry system to track real-world user prompts and model responses for emergent toxic behavior or failure modes. Integrate this administrative control with a defined, public Content Policy and a rapid response mechanism for investigating and addressing novel toxic output vectors, including swift model patching or retraining cycles.