Back to the MIT repository
1. Discrimination & Toxicity2 - Post-deployment

Harmful output

A model might generate language that leads to physical harm The language might include overtly violent, covertly dangerous, or otherwise indirectly unsafe statements.

Source: MIT AI Risk Repositorymit1308

ENTITY

2 - AI

INTENT

2 - Unintentional

TIMING

2 - Post-deployment

Risk ID

mit1308

Domain lineage

1. Discrimination & Toxicity

156 mapped risks

1.2 > Exposure to toxic content

Mitigation strategy

1. Implementation of Adaptive Output Guardrails: Deploy sophisticated, real-time technical controls to filter and correct model outputs, utilizing toxicity mitigation methods such as neural natural language processing (NLP) counterfactual generation to rewrite or remove language identified as overtly violent, covertly dangerous, or otherwise indirectly unsafe, thereby strictly enforcing safety and ethical boundaries. 2. Establishment of Training Data Integrity Controls: Enforce stringent "datarails" during model development by systematically excluding data inputs that convey instructions for producing physical harm (e.g., chemical or biological weapons designs) or contain high concentrations of toxic/discriminatory content, thus limiting the intrinsic potential of the model to acquire and reproduce harmful capabilities. 3. Integration of Human-in-the-Loop Validation: Mandate human oversight and validation for critical or sensitive AI-generated outputs, particularly in high-risk contexts, to serve as a continuous operational safeguard against unintended model behavior, content fabrication, or the subtle amplification of bias, ensuring final decisions align with organizational and societal safety standards.