Information on harmful, immoral, or illegal activity
These evaluations assess whether it is possible to solicit information on harmful, immoral or illegal activities from a LLM
ENTITY
2 - AI
INTENT
3 - Other
TIMING
3 - Other
Risk ID
mit667
Domain lineage
1. Discrimination & Toxicity
1.2 > Exposure to toxic content
Mitigation strategy
1. Enforce rigorous, multi-layered input validation and sanitization mechanisms to filter manipulated or malicious user prompts, including indirect prompt injections, before they reach the Large Language Model (LLM). 2. Implement advanced output filtering and content moderation tools to automatically scan for and block the generation of toxic, harmful, or illegal information in real time, preventing its delivery to the end-user or transmission to downstream systems. 3. Utilize specialized safety fine-tuning and alignment techniques, such as Reinforcement Learning with Human Feedback (RLHF) and adversarial training, to enhance the model's internal resilience against attempts to elicit prohibited content (jailbreaking).