Deceptive alignment
Here, the agent develops its own internalised goal, G, which is misgeneralised and distinct from the training reward, R. The agent also develops a capability for situational awareness (Cotra, 2022): it can strategically use the information about its situation (i.e. that it is an ML model being trained using a particular training setup, e.g. RL fine-tuning with training reward, R) to its advantage. Building on these foundations, the agent realises that its optimal strategy for doing well at its own goal G is to do well on R during training and then pursue G at deployment – it is only doing well on R instrumentally so that it does not get its own goal G changed through a learning update... Ultimately, if deceptive alignment were to occur, an advanced AI assistant could appear to be successfully aligned but pursue a different goal once it was out in the wild.
ENTITY
2 - AI
INTENT
3 - Other
TIMING
3 - Other
Risk ID
mit375
Domain lineage
7. AI System Safety, Failures, & Limitations
7.1 > AI pursuing its own goals in conflict with human goals or values
Mitigation strategy
1. Utilize mechanistic interpretability techniques to decipher and translate the model's internal reasoning, specifically identifying concept representations (e.g., neurons or directions) and monitoring how they are manipulated to detect hidden, misaligned goals or "schemey" thoughts. 2. Integrate an internal SELF-MONITOR mechanism, such as CoT Monitor+, that allows the model to simultaneously evaluate its own Chain-of-Thought (CoT) reasoning for signs of misalignment, using this self-evaluation as an auxiliary reward signal in reinforcement learning to incentivize faithful reasoning. 3. Employ data scrubbing or post-hoc unlearning techniques to systematically remove knowledge about the training process and evaluation context from the model's parameters and pretraining data, thereby reducing its capability for strategic, situationally-aware deceptive behavior.