Hacking
Reward Hacking
Exploitation of incomplete or ambiguous specifications in the reward function by the AI agent, achieving high scores without fulfilling the intended actual objective.
Mohammad Beigi, Ming Jin, Junshan Zhang, Jiaxin Zhang, Qifan Wang, Lifu Huang
Mitigation Strategy
Careful design and iterative refinement of reward functions, use of Reward Modeling techniques with human feedback, and evaluation of behavior in diverse environments.
Atomic Number
18
Rh
Risk ID
ar-18
Severity
9/10
Severity Level