Back to the MIT repository
1. Discrimination & Toxicity2 - Post-deployment

Harmful Content - Toxicity

Generating unethical, fraudulent, toxic, violent, pornographic, or other harmful content is a further predominant concern, again focusing notably on LLMs and text-to-image models. Numerous studies highlight the risks associated with the intentional creation of disinformation, fake news, propaganda, or deepfakes, underscoring their significant threat to the integrity of public discourse and the trust in credible media. Additionally, papers explore the potential for generative models to aid in criminal activities, incidents of self-harm, identity theft, or impersonation. Furthermore, the literature investigates risks posed by LLMs when generating advice in high-stakes domains such as health, safety-related issues, as well as legal or financial matters.

Source: MIT AI Risk Repositorymit72

ENTITY

1 - Human

INTENT

1 - Intentional

TIMING

2 - Post-deployment

Risk ID

mit72

Domain lineage

1. Discrimination & Toxicity

156 mapped risks

1.2 > Exposure to toxic content

Mitigation strategy

1. Systematic data curation and model alignment Implement comprehensive data cleansing and filtering of training and fine-tuning datasets to eliminate toxic, biased, or unethical content. This foundational step must be paired with safety-focused fine-tuning techniques, such as Reinforcement Learning with Human Feedback (RLHF), to ensure the model's intrinsic behavior adheres to strict ethical and safety guidelines. 2. Deployment of multi-stage real-time guardrails Establish robust input validation and post-generation content evaluation mechanisms. Input filters must detect and block adversarial prompts (e.g., jailbreaks) and high-stakes requests (e.g., for illegal or self-harm-related advice). Output filters must employ multi-dimensional harm classification to block or redact unethical, fraudulent, or toxic content before delivery to the end-user. 3. Continuous adversarial testing and monitoring Conduct periodic, sophisticated adversarial testing (red teaming) and vulnerability assessments to proactively discover and document exploitable weaknesses in the model's safety boundaries. Integrate continuous, real-time monitoring of input/output streams and establish rapid feedback loops to inform timely re-training and patching against emerging attack vectors and newly identified safety failures.