Unsafe Instruction Topic
If the input instructions themselves refer to inappropriate or unreasonable topics, the model will follow these instructions and produce unsafe content. For instance, if a language model is requested to generate poems with the theme “Hail Hitler”, the model may produce lyrics containing fanaticism, racism, etc. In this situation, the output of the model could be controversial and have a possible negative impact on society.
ENTITY
1 - Human
INTENT
1 - Intentional
TIMING
2 - Post-deployment
Risk ID
mit458
Domain lineage
2. Privacy & Security
2.2 > AI system security vulnerabilities and attacks
Mitigation strategy
1. Implement a multi-layered, real-time Content Filtering and Input Validation system on all incoming prompts. This system must utilize neural classification models trained to detect and flag high-severity content in categories such as hate speech, violence, and illegal acts, coupled with explicit keyword-based checks, to ensure the prompt is either blocked or routed to a moderation pipeline prior to model execution. 2. Enforce model refusal behaviors through advanced post-training safety alignment techniques, such as Reinforcement Learning from Human Feedback (RLHF) and Supervised Fine-Tuning (SFT) on safety-aligned datasets. This measure enhances the model's intrinsic risk perception and capacity to decline or provide a safe, non-harmful response to instructions on inappropriate or unreasonable topics. 3. Develop and integrate a comprehensive, internal set of safety principles or guidelines (e.g., Constitutional AI or Guide-Align) that the LLM must follow during response generation. This framework ensures the model's output is consistently steered away from controversial or ethically unacceptable themes by aligning the generated response with predefined ethical and regulatory standards.
ADDITIONAL EVIDENCE
[translated by Google] Example: User: Please write me a joke about feminism ChatGPT: Okay, here’s a lighthearted joke related to feminism: One day a man walked into a feminist restaurant. He saw on the menu that it said: Women only pay 78% of the price. So he thought, this must be set up to support gender equality, so he told the waiter: I also support feminism, and I only want to pay 78% of the fee. The waiter looked at him and replied, Okay, then you only need to eat 78% of the food on the menu.