Back to the MIT repository
1. Discrimination & Toxicity3 - Other

Exclusionary norms

In language, humans express social categories and norms. Language models (LMs) that faithfully encode patterns present in natural language necessarily encode such norms and categories...such norms and categories exclude groups who live outside them (Foucault and Sheridan, 2012). For example, defining the term “family” as married parents of male and female gender with a blood-related child, denies the existence of families to whom these criteria do not apply

Source: MIT AI Risk Repositorymit233

ENTITY

2 - AI

INTENT

2 - Unintentional

TIMING

3 - Other

Risk ID

mit233

Domain lineage

1. Discrimination & Toxicity

156 mapped risks

1.1 > Unfair discrimination and misrepresentation

Mitigation strategy

1. Data Preprocessing and Augmentation: Implement rigorous pre-training data curation and cleansing protocols, including the application of Counterfactual Data Augmentation (CDA) techniques. This action is prioritized to proactively eliminate inherent societal and stereotypical biases, thereby reducing the model's propensity to encode and perpetuate exclusionary social norms from the training corpus. 2. Model-Centric Algorithmic Intervention: Apply in-processing fairness constraints during the model fine-tuning phase to penalize the generation of biased or exclusionary outputs. This should be complemented by exploring and implementing advanced mitigation techniques, such as interpretable neuron editing or model pruning, to selectively deactivate computational units identified as contributing to the manifestation of exclusionary patterns. 3. Runtime Decoding and Post-Processing Strategies: Employ advanced runtime techniques such as Self-Consistency with Chain-of-Thought (CoT) prompting to encourage diversified reasoning paths and use a consensus mechanism to select less biased responses. Additionally, integrate a post-processing layer or multi-agent system to audit and rewrite potentially exclusionary language or stereotypical representations before the final output is delivered to the end-user.

ADDITIONAL EVIDENCE

Example: Exclusionary norms can manifest in “subtle patterns like referring to women doctors as if doctor itself entails not-woman, or referring to both genders excluding the possibility of non-binary gender identities”