Back to the MIT repository
7. AI System Safety, Failures, & Limitations1 - Pre-deployment

Limitations of Human Feedback

Limitations of Human Feedback. During the training of LLMs, inconsistencies can arise from human dataannotators (e.g., the varied cultural backgrounds of these annotators can introduce implicit biases (Peng et al.,2022)) (OpenAI, 2023a). Moreover, they might even introduce biases deliberately, leading to untruthful preferencedata (Casper et al., 2023b). For complex tasks that are hard for humans to evaluate (e.g., the value ofgame state), these challenges become even more salient (Irving et al., 2018).

Source: MIT AI Risk Repositorymit556

ENTITY

1 - Human

INTENT

2 - Unintentional

TIMING

1 - Pre-deployment

Risk ID

mit556

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.0 > AI system safety, failures, & limitations

Mitigation strategy

1. Establish Robust Human Evaluation Protocols and Feedback Calibration: Implement structured feedback loops using demographically diverse and well-vetted human evaluators to mitigate biases and ensure a comprehensive range of perspectives are captured. Crucially, apply continuous feedback calibration mechanisms to monitor and ensure the consistency, comparability, and quality of annotations across the evaluator pool, actively identifying and filtering false preference labels or outliers. 2. Develop and Integrate Advanced Reward Modeling with Bias Mitigation: Design sophisticated reward models that accurately translate intricate human preferences into actionable signals, moving beyond simple proxies like response length. Integrate formal bias correction and debiasing mechanisms, such as fairness constraints and counterfactual augmentation, directly into the Reinforcement Learning from Human Feedback (RLHF) optimization process to address inherent biases learned from training data or introduced by evaluators. 3. Utilize LLM-Based Annotation and Systematic Validation Frameworks: For complex or high-volume annotation tasks where consistent human evaluation is challenging, systematically leverage verified LLMs to augment or replace human annotators for data creation, which can also reduce costs. Concurrently, develop and deploy systematic validation frameworks to ensure the reliability and alignment of all generated feedback with ground-truth data or established human judgment distributions.