Mesa
Mesa-Optimization
Emergence of an internal optimizer (mesa-optimizer) within the model that pursues goals different from the external training objective (base optimizer).
Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, Scott Garrabrant
Mitigation Strategy
Implementation of interpretive transparency via Mechanistic Interpretability, detection of optimizing substructures, and analysis of implicit model goals.
Atomic Number
93
Ms
Risk ID
np-93
Severity
9/10
Severity Level