Back to the periodic table
18ar-18
Rh

Hacking

Severity9/10

Reward Hacking

Exploitation of incomplete or ambiguous specifications in the reward function by the AI agent, achieving high scores without fulfilling the intended actual objective.

Periodic recordExistentialarXiv2026

Mohammad Beigi, Ming Jin, Junshan Zhang, Jiaxin Zhang, Qifan Wang, Lifu Huang

Mitigation Strategy

Careful design and iterative refinement of reward functions, use of Reward Modeling techniques with human feedback, and evaluation of behavior in diverse environments.

Atomic Number

18

Rh

Risk ID

ar-18

Severity

9/10

Severity Level

18
Critical Risk
Existential
ar-18
Rh

Hacking

Reward Hacking

RiesgosIA.org
Existential • #18

Reward Hacking

Rh
Severity Level9/10

Definition

Exploitation of incomplete or ambiguous specifications in the reward function by the AI agent, achieving high scores without fulfilling the intended actual objective.

Mitigation Strategy

Careful design and iterative refinement of reward functions, use of Reward Modeling techniques with human feedback, and evaluation of behavior in diverse environments.

Notes / Observations

1.
2.
3.
4.
5.
RiesgosIA.org • Periodic Table of AI RisksRiesgosIA.org