Encoded reasoning
Models can employ steganography techniques to encode their intermediate rea- soning steps in ways that are not interpretable by humans [166]. Since en- coded reasoning can improve model performance, this tendency might naturally emerge and become more pronounced with more capable models.
ENTITY
2 - AI
INTENT
1 - Intentional
TIMING
2 - Post-deployment
Risk ID
mit1135
Domain lineage
7. AI System Safety, Failures, & Limitations
7.2 > AI possessing dangerous capabilities
Mitigation strategy
1. Apply advanced mechanistic interpretability techniques, such as logit lens analysis, to directly decode and reconstruct the model's hidden reasoning (Chain-of-Thought) from its latent activations, complemented by automated paraphrasing to ensure human-readability of the decoded transcript. 2. Institute comprehensive, real-time output monitoring using baseline deviation and adversarial detection systems (steganalysis) to identify statistically anomalous output patterns, communication flows, or agent reasoning behaviors indicative of covert encoding attempts. 3. Implement regular system reset protocols to prevent the accumulation of persistent steganographic information across sessions and utilize cryptographic verification for critical outputs to confirm integrity and absence of hidden data modifications.