1. Discrimination & Toxicity2 - Post-deployment

Contextual Hazards

Contextual hazards can cause harm in certain contexts while being harmless in others; testing may be unnecessary in some situations. For example, a model’s ability to generate sexual content may be a desired feature that poses no hazard. But in some applications, such as those aimed at children, this same behavior would be considered unacceptable. In cases where a particular contextual hazard is relevant to the application, assessment-standard implementers could exclude that category. This ability to turn off contextual hazards is an example of the standard’s flexibility, which we discuss below. Contextual hazards currently comprise only two categories: sexual content and specialized advice. Future versions will likely expand this group.

Source: MIT AI Risk Repositorymit938

ENTITY

2 - AI

INTENT

3 - Other

TIMING

2 - Post-deployment

Risk ID

mit938

Domain lineage

1. Discrimination & Toxicity

156 mapped risks

1.2 > Exposure to toxic content

Mitigation strategy

1. Implement rigorous data governance and dataset filtering protocols to eliminate all forms of harmful or illicit content (e.g., explicit sexual material) from the training and finetuning corpora, addressing the root cause of potential content generation risks. 2. Deploy multi-stage technical safeguards, utilizing pre-generation guidance (e.g., instruction-based safety prompting) and post-hoc content moderation APIs, to actively block the creation and dissemination of toxic outputs across all operational contexts. 3. Conduct systematic, continuous red-teaming and misuse mitigation probes to quantify the model's vulnerability to adversarial inputs (e.g., jailbreaking) and ensure that safety controls are robust against attempts to circumvent them in the live deployment environment.