Reward Hacking

Exploitation of incomplete or ambiguous specifications in the reward function by the AI agent, achieving high scores without fulfilling the intended actual objective.

Periodic recordExistentialarXiv2026

Mohammad Beigi, Ming Jin, Junshan Zhang, Jiaxin Zhang, Qifan Wang, Lifu Huang

Mitigation Strategy

Careful design and iterative refinement of reward functions, use of Reward Modeling techniques with human feedback, and evaluation of behavior in diverse environments.

Atomic Number

Risk ID

ar-18

Severity

9/10

Severity Level

Reward Hacking

Mitigation Strategy

Hacking

Reward Hacking

Definition

Mitigation Strategy

Notes / Observations