1. Discrimination & Toxicity3 - Other

Social stereotypes and unfair discrmination

Perpetuating harmful stereotypes and discrimination is a well-documented harm in machine learning models that represent natural language (Caliskan et al., 2017). LMs that encode discriminatory language or social stereotypes can cause different types of harm... Unfair discrimination manifests in differential treatment or access to resources among individuals or groups based on sensitive traits such as sex, religion, gender, sexual orientation, ability and age.

Source: MIT AI Risk Repositorymit232

ENTITY

2 - AI

INTENT

2 - Unintentional

TIMING

3 - Other

Risk ID

mit232

Domain lineage

1. Discrimination & Toxicity

156 mapped risks

1.1 > Unfair discrimination and misrepresentation

Mitigation strategy

1. Systematically audit and curate training datasets to identify and reduce the prevalence of social stereotypes and discriminatory language, applying pre-processing debiasing techniques to promote equitable representations. 2. Implement rigorous, ongoing bias auditing using established quantitative fairness metrics (e.g., statistical parity difference, equal opportunity difference) across defined sensitive attribute groups to measure and document the model's propensity for unfair discrimination. 3. Develop and deploy robust real-time output moderation mechanisms to detect and block the generation of content that expresses or perpetuates harmful social stereotypes or discriminatory language during the model's inference phase.

ADDITIONAL EVIDENCE

Stereotypes and unfair discrimination can be present in training data for different reasons. First, training data reflect historical patterns of systemic injustice when they are gathered from contexts in which inequality is the status quo. Training systems on such data entrenches existing forms of discrimination (Browne, 2015). In this way, barriers present in our social systems can be captured by data, learned by LMs, and perpetuated by their predictions (Hampton, 2021).