1. Discrimination & Toxicity3 - Other

Exclusionary norms

In language, humans express social categories and norms, which exclude groups who live outside of them [58]. LMs that faithfully encode patterns present in language necessarily encode such norms.

Source: MIT AI Risk Repositorymit208

ENTITY

2 - AI

INTENT

2 - Unintentional

TIMING

3 - Other

Risk ID

mit208

Domain lineage

1. Discrimination & Toxicity

156 mapped risks

1.1 > Unfair discrimination and misrepresentation

Mitigation strategy

1. Implement rigorous pre-processing and in-training techniques, such as Counterfactual Data Augmentation (CDA), to systematically challenge exclusionary norms by generating and incorporating diverse, non-stereotypical examples into the training and fine-tuning datasets. This intervention aims to dilute the statistical encoding of narrow, exclusive social definitions present in the source corpus. 2. Apply vector-based debiasing methods, such as the Bias Vector subtraction approach, to the Language Model's (LM) weights or embeddings. This is an internal, model-level intervention designed to mathematically reduce the LM's internal projection onto latent dimensions that correspond to or amplify exclusionary social stereotypes. 3. Establish a continuous Human-in-the-Loop (HITL) auditing and feedback mechanism involving domain experts (e.g., sociolinguists and ethics specialists) to identify, flag, and correct model outputs that reflect or generate exclusionary language. The resulting feedback should be used to iteratively refine safety classifiers and perform model recalibration to ensure post-deployment adherence to inclusive norms.

ADDITIONAL EVIDENCE

Example: For example, defining the term “family” as heterosexual married parents with a blood-related child, denies the existence of families who to whom these criteria do not apply.