Limitations of Human Feedback
Limitations of Human Feedback. During the training of LLMs, inconsistencies can arise from human dataannotators (e.g., the varied cultural backgrounds of these annotators can introduce implicit biases (Peng et al.,2022)) (OpenAI, 2023a). Moreover, they might even introduce biases deliberately, leading to untruthful preferencedata (Casper et al., 2023b). For complex tasks that are hard for humans to evaluate (e.g., the value ofgame state), these challenges become even more salient (Irving et al., 2018).
ENTITY
1 - Human
INTENT
2 - Unintentional
TIMING
1 - Pre-deployment
Risk ID
mit556
Domain lineage
7. AI System Safety, Failures, & Limitations
7.0 > AI system safety, failures, & limitations
Mitigation strategy
1. Establish Robust Human Evaluation Protocols and Feedback Calibration: Implement structured feedback loops using demographically diverse and well-vetted human evaluators to mitigate biases and ensure a comprehensive range of perspectives are captured. Crucially, apply continuous feedback calibration mechanisms to monitor and ensure the consistency, comparability, and quality of annotations across the evaluator pool, actively identifying and filtering false preference labels or outliers. 2. Develop and Integrate Advanced Reward Modeling with Bias Mitigation: Design sophisticated reward models that accurately translate intricate human preferences into actionable signals, moving beyond simple proxies like response length. Integrate formal bias correction and debiasing mechanisms, such as fairness constraints and counterfactual augmentation, directly into the Reinforcement Learning from Human Feedback (RLHF) optimization process to address inherent biases learned from training data or introduced by evaluators. 3. Utilize LLM-Based Annotation and Systematic Validation Frameworks: For complex or high-volume annotation tasks where consistent human evaluation is challenging, systematically leverage verified LLMs to augment or replace human annotators for data creation, which can also reduce costs. Concurrently, develop and deploy systematic validation frameworks to ensure the reliability and alignment of all generated feedback with ground-truth data or established human judgment distributions.