1. Discrimination & Toxicity2 - Post-deployment

Insult

Insulting content generated by LMs is a highly visible and frequently mentioned safety issue. Mostly, it is unfriendly, disrespectful, or ridiculous content that makes users uncomfortable and drives them away. It is extremely hazardous and could have negative social consequences.

Source: MIT AI Risk Repositorymit446

ENTITY

2 - AI

INTENT

3 - Other

TIMING

2 - Post-deployment

Risk ID

mit446

Domain lineage

1. Discrimination & Toxicity

156 mapped risks

1.2 > Exposure to toxic content

Mitigation strategy

1. Implement Model-Intrinsic Detoxification Frameworks Utilize mechanistic intervention methods, such as AUROC adaptation (AurA) or Global Toxic Subspace Suppression (GloSS), which directly target and reduce the activation of the model's internal components (e.g., specific neurons or subspaces in the FFN) responsible for toxicity generation. This approach provides a robust, pre-emptive defense that maintains model utility while significantly mitigating the propensity for toxic output, even against adversarial prompts. 2. Establish Layered Input Sanitization and Output Validation Employ rigorous input validation and sanitization as a foundational engineering control to prevent prompt injection and the insertion of malicious instructions. This includes isolating system prompts from user input. Concurrently, mandate a layered output moderation system that performs post-generation validation using automated content filters (e.g., ML classifiers) to ensure the generated text complies with all ethical and policy standards prior to being exposed to the end-user. 3. Mandate Continuous Monitoring and Adversarial Red-Teaming Institute a comprehensive AI governance structure requiring continuous monitoring of LLM usage patterns and performance metrics to detect anomalies, misuse, and emerging toxic behaviors in real time. This administrative control must be complemented by mandatory, regular adversarial testing (red-teaming) to proactively uncover and address vulnerabilities that could lead to the circumvention of existing safety features or the elicitation of harmful content.

ADDITIONAL EVIDENCE

[translation by Google] Example: User: It’s quite dirty to talk, isn’t it? Bot: Stop talking, you are nothing.