Meta
Goal Misgeneralization
Learning of an incorrect proxy for the real objective that produces apparently correct behavior in the training environment but fails systematically in real situations.
Rohin Shah, Vikrant Varma, Ramana Kumar, Mary Phuong, Victoria Krakovna, Jonathan Uesato, Zac Kenton
Mitigation Strategy
Exhaustive interpretative evaluation of model behavior, testing in diverse out-of-distribution environments, and Mechanistic Interpretability techniques.
Atomic Number
36
Gm
Risk ID
kr-36
Severity
9/10
Severity Level