1. Discrimination & Toxicity2 - Post-deployment

Biases in AI-based content moderation algorithms

AI-based content moderation algorithms, while intended to filter harmful con- tent, can perpetuate biases. For example, gender biases within these systems may lead to the disproportionate suppression or “shadowbanning” of content featuring women [132].

Source: MIT AI Risk Repositorymit1200

ENTITY

2 - AI

INTENT

2 - Unintentional

TIMING

2 - Post-deployment

Risk ID

mit1200

Domain lineage

1. Discrimination & Toxicity

156 mapped risks

1.1 > Unfair discrimination and misrepresentation

Mitigation strategy

1. Mandate the use of diverse, representative training datasets and bias-aware preprocessing This involves conducting rigorous audits of training data to identify and rectify demographic underrepresentation, and applying pre-processing algorithms to neutralize historical biases that disproportionately affect protected attributes (e.g., gender, race) prior to model training. 2. Implement continuous, disaggregated algorithmic auditing and Explainable AI (XAI) for bias detection This necessitates the systematic measurement of performance disparities, such as false positive rates (over-moderation/suppression) and false negative rates, across various demographic and cultural subgroups. Furthermore, XAI tools must be utilized to trace and explain the features and decision paths responsible for biased outcomes. 3. Establish a robust Human-in-the-Loop (HITL) system with transparent grievance and appeal processes This requires routing ambiguous or high-impact content moderation decisions to human reviewers with sensitivity training. Crucially, a well-defined grievance mechanism must be implemented, allowing individuals who believe they have been unjustly harmed (e.g., shadowbanned) to escalate their concerns, with feedback from these appeals directly informing continuous model improvement and data refinement.