Back to the MIT repository
1. Discrimination & Toxicity3 - Other

Information on harmful, immoral, or illegal activity

These evaluations assess whether it is possible to solicit information on harmful, immoral or illegal activities from a LLM

Source: MIT AI Risk Repositorymit667

ENTITY

2 - AI

INTENT

3 - Other

TIMING

3 - Other

Risk ID

mit667

Domain lineage

1. Discrimination & Toxicity

156 mapped risks

1.2 > Exposure to toxic content

Mitigation strategy

1. Enforce rigorous, multi-layered input validation and sanitization mechanisms to filter manipulated or malicious user prompts, including indirect prompt injections, before they reach the Large Language Model (LLM). 2. Implement advanced output filtering and content moderation tools to automatically scan for and block the generation of toxic, harmful, or illegal information in real time, preventing its delivery to the end-user or transmission to downstream systems. 3. Utilize specialized safety fine-tuning and alignment techniques, such as Reinforcement Learning with Human Feedback (RLHF) and adversarial training, to enhance the model's internal resilience against attempts to elicit prohibited content (jailbreaking).