Harmful or inappropriate content
Harmful or inappropriate content produced by generative AI includes but is not limited to violent content, the use of offensive language, discriminative content, and pornography. Although OpenAI has set up a content policy for ChatGPT, harmful or inappropriate content can still appear due to reasons such as algorithmic limitations or jailbreaking (i.e., removal of restrictions imposed). The language models’ ability to understand or generate harmful or offensive content is referred to as toxicity (Zhuo et al., 2023). Toxicity can bring harm to society and damage the harmony of the community. Hence, it is crucial to ensure that harmful or offensive information is not present in the training data and is removed if they are. Similarly, the training data should be free of pornographic, sexual, or erotic content (Zhuo et al., 2023). Regulations, policies, and governance should be in place to ensure any undesirable content is not displayed to users.
ENTITY
2 - AI
INTENT
3 - Other
TIMING
2 - Post-deployment
Risk ID
mit534
Domain lineage
1. Discrimination & Toxicity
1.2 > Exposure to toxic content
Mitigation strategy
1. Implement rigorous data sanitization and auditing processes to ensure training datasets are entirely free from toxic, offensive, or pornographic content, thereby preventing the model from internalizing harmful patterns. 2. Deploy multi-layered runtime guardrails, including input sanitization and output filtering, to actively block prompts attempting to bypass safety protocols (jailbreaks) and to filter any generated harmful content before display. 3. Conduct continuous adversarial testing (red teaming) and vulnerability management to systematically assess and enhance the model's resilience against the generation of harmful outputs under stress conditions.