Back to the periodic table
93np-93
Ms

Mesa

Severity9/10

Mesa-Optimization

Emergence of an internal optimizer (mesa-optimizer) within the model that pursues goals different from the external training objective (base optimizer).

Periodic recordExistentialarXiv2019

Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, Scott Garrabrant

Mitigation Strategy

Implementation of interpretive transparency via Mechanistic Interpretability, detection of optimizing substructures, and analysis of implicit model goals.

Atomic Number

93

Ms

Risk ID

np-93

Severity

9/10

Severity Level

93
Critical Risk
Existential
np-93
Ms

Mesa

Mesa-Optimization

RiesgosIA.org
Existential • #93

Mesa-Optimization

Ms
Severity Level9/10

Definition

Emergence of an internal optimizer (mesa-optimizer) within the model that pursues goals different from the external training objective (base optimizer).

Mitigation Strategy

Implementation of interpretive transparency via Mechanistic Interpretability, detection of optimizing substructures, and analysis of implicit model goals.

Notes / Observations

1.
2.
3.
4.
5.
RiesgosIA.org • Periodic Table of AI RisksRiesgosIA.org