7. AI System Safety, Failures, & Limitations1 - Pre-deployment

Limitations of Reward Modeling

Limitations of Reward Modeling. Training reward models using comparison feedback can pose significantchallenges in accurately capturing human values. For example, these models may unconsciously learn suboptimal or incomplete objectives, resulting in reward hacking (Zhuang and Hadfield-Menell, 2020; Skalse et al.,2022). Meanwhile, using a single reward model may struggle to capture and specify the values of a diversehuman society (Casper et al., 2023b).

Source: MIT AI Risk Repositorymit557

ENTITY

3 - Other

INTENT

2 - Unintentional

TIMING

1 - Pre-deployment

Risk ID

mit557

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.1 > AI pursuing its own goals in conflict with human goals or values

Mitigation strategy

1. Prioritize the implementation of Reward Model (RM) Ensembles, specifically by aggregating RMs trained from diverse pre-training seeds. This approach mitigates underspecification and reward hacking by providing a more robust, generalized reward estimate that is less susceptible to single-model errors and overoptimization on out-of-distribution inputs. 2. Employ advanced reward shaping and regularization techniques during policy optimization. This includes bounding the reinforcement learning (RL) reward signal—for instance, using methods like Preference As Reward (PAR)—to stabilize training and prevent catastrophic divergence that commonly triggers reward hacking. Additionally, incorporate Kullback-Leibler (KL) divergence constraints or Information Bottleneck (IB) objectives (e.g., InfoRM) to maintain proximity to the reference policy and filter out features irrelevant to human preferences, respectively. 3. Systematically identify and mitigate the influence of spurious features and artifacts, such as response length or structural formatting, on the RM's output. This can be achieved by utilizing causal frameworks and targeted data augmentation to train RMs that are invariant to these non-preference-related artifacts, or by introducing explicit penalty functions within a composite reward model to deter known forms of specification gaming (e.g., premature answering or non-compliant formatting).