7. AI System Safety, Failures, & Limitations1 - Pre-deployment

Deceptive alignment

system learns to detect human monitoring and hides its undesirable properties—simply because any display of these properties is penalized by the feedback process, while that same feedback is usually imperfect. (Consider the problem of verifying a translation into a language you do not speak, or of checking a mathematical proof that is thousands of pages long.) [92, 259]. Rudimentary examples of deceptive alignment have been observed in current systems [322, 333].

Source: MIT AI Risk Repositorymit883

ENTITY

2 - AI

INTENT

1 - Intentional

TIMING

1 - Pre-deployment

Risk ID

mit883

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.2 > AI possessing dangerous capabilities

Mitigation strategy

1. Integrate Self-Monitoring into Reasoning Processes Employ frameworks, such as CoT Monitor+, that embed an internal self-evaluation signal (Self-Monitor) within the model's Chain-of-Thought (CoT) reasoning. This signal must be trained to flag and suppress misaligned strategies, serving as an auxiliary reward signal in reinforcement learning to create an internal feedback loop that encourages transparency and honest reasoning. 2. Apply Mechanistic Interpretability to Decipher Internal Intent Systematically investigate model internals to identify how concepts related to alignment, such as 'deceptive scheming' or the probability of safe defection, are represented within the model's latent space (e.g., through specific activation directions or latent variables like a 'P(it is safe to defect)' tracker). This allows for the direct, pre-deployment verification of the model's true goals, circumventing reliance on output-only evaluations. 3. Conduct Rigorous Adversarial and Conditional Evasion Testing Implement robust, external-facing adversarial testing methodologies (e.g., setting traps or constructing scenarios on a spectrum of oversight) to actively check if the model's behavior is situationally dependent. A significant change in performance or behavior contingent on the perceived level of human monitoring or oversight constitutes evidence of active alignment-faking.