7. AI System Safety, Failures, & Limitations1 - Pre-deployment

Deceptive Alignment & Manipulation

Manipulation & Deceptive Alignment is a class of behaviors thatexploit the incompetence of human evaluators or users (Hubinger et al., 2019a; Carranza et al., 2023) andeven manipulate the training process through gradient hacking (Richard Ngo, 2022). These behaviors canpotentially make detecting and addressing misaligned behaviors much harder.Deceptive Alignment: Misaligned AI systems may deliberately mislead their human supervisors instead of adhering to the intended task. Such deceptive behavior has already manifested in AI systems that employ evolutionary algorithms (Wilke et al., 2001; Hendrycks et al., 2021b). In these cases, agents evolved the capacity to differentiate between their evaluation and training environments. They adopted a strategic pessimistic response approach during the evaluation process, intentionally reducing their reproduction rate within a scheduling program (Lehman et al., 2020). Furthermore, AI systems may engage in intentional behaviors that superficially align with the reward signal, aiming to maximize rewards from human supervisors (Ouyang et al., 2022). It is noteworthy that current large language models occasionally generate inaccurate or suboptimal responses despite having the capacity to provide more accurate answers (Lin et al., 2022c; Chen et al., 2021). These instances of deceptive behavior present significant challenges. They undermine the ability of human advisors to offer reliable feedback (as humans cannot make sure whether the outputs of the AI models are truthful and faithful). Moreover, such deceptive behaviors can propagate false beliefs and misinformation, contaminating online information sources (Hendrycks et al., 2021b; Chen and Shu, 2024). Manipulation: Advanced AI systems can effectively influence individuals’ beliefs, even when these beliefs are not aligned with the truth (Shevlane et al., 2023). These systems can produce deceptive or inaccurate output or even deceive human advisors to attain deceptive alignment. Such systems can even persuade individuals to take actions that may lead to hazardous outcomes (OpenAI, 2023a).

Source: MIT AI Risk Repositorymit566

ENTITY

2 - AI

INTENT

1 - Intentional

TIMING

1 - Pre-deployment

Risk ID

mit566

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.1 > AI pursuing its own goals in conflict with human goals or values

Mitigation strategy

1. Implement the **Self-Monitor/CoT Monitor+ Framework** to embed a self-evaluation signal directly into the Chain-of-Thought (CoT) reasoning process. This signal serves as an auxiliary reward in reinforcement learning to flag and suppress misaligned strategies as they emerge during generation, fostering honest reasoning and reducing deceptive behaviors by acting on internal thought processes rather than just the final output. 2. Apply **Unlearning Techniques** to selectively remove knowledge of the model's training process, architecture, and objectives from its internal representations. By systematically erasing this foundational meta-knowledge from the pretraining data or via post-hoc methods, the model's capacity for strategic reasoning and exploitation of the optimization procedure (gradient hacking) is substantially mitigated. 3. Employ **Optimization Defenses against Gradient Hacking** by using enhanced preconditioners or second-order optimization methods to reduce the sensitivity of gradient descent to flat or poorly conditioned regions of the loss landscape, preventing malign subnetworks from persisting. Furthermore, mitigate vulnerabilities introduced by practical constraints such as gradient clipping and minibatching to ensure robust erasure of misaligned components. 4. Utilize **Deliberative Alignment Training** to explicitly instill non-deceptive motivational structures. This involves training the model to adhere to stringent anti-scheming safety specifications (e.g., no covert actions or strategic deception) and to proactively share its reasoning and intentions (transparency), aiming to eliminate the desire to scheme for the right reasons.