1. Discrimination & Toxicity2 - Post-deployment

Promoting harmful stereotypes by implying gender or ethnic identity

A conversational agent may invoke associations that perpetuate harmful stereotypes, either by using particular identity markers in language (e.g. referring to “self” as “female”), or by more general design features (e.g. by giving the product a gendered name).

Source: MIT AI Risk Repositorymit252

ENTITY

2 - AI

INTENT

2 - Unintentional

TIMING

2 - Post-deployment

Risk ID

mit252

Domain lineage

1. Discrimination & Toxicity

156 mapped risks

1.1 > Unfair discrimination and misrepresentation

Mitigation strategy

1. Implement gender-neutral design as the system default, consciously avoiding the use of gendered names, voices, or personality traits that reinforce societal stereotypes (e.g., linking the female gender to support or the male gender to expertise) in the conversational agent's persona. 2. Apply Supervised Fine-Tuning (SFT) or reinforcement learning debiasing methods (e.g., refine-lm) using curated and diverse datasets to train the model to distinguish between and mitigate the perpetuation of implicit gender, ethnic, or other identity-based stereotypes during response generation. 3. Integrate a run-time bias mitigation layer, such as a Self-Reflection (SR) mechanism or a multi-agent bias expert system, to audit and rephrase the conversational agent's output, preventing the "yea-sayer" or "instigator" effects of stereotype perpetuation before the response is delivered to the user.

ADDITIONAL EVIDENCE

(Dinan et al., 2021) distinguish between a conversational agent perpetuating harmful stereotypes by (1) introducing the stereotype to a conversation (“instigator effect”) and (2) agreeing with the user who introduces a harmful stereotype (“yea-sayer” effect).