Back to the MIT repository
7. AI System Safety, Failures, & Limitations2 - Post-deployment

Encoded reasoning

Models can employ steganography techniques to encode their intermediate rea- soning steps in ways that are not interpretable by humans [166]. Since en- coded reasoning can improve model performance, this tendency might naturally emerge and become more pronounced with more capable models.

Source: MIT AI Risk Repositorymit1135

ENTITY

2 - AI

INTENT

1 - Intentional

TIMING

2 - Post-deployment

Risk ID

mit1135

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.2 > AI possessing dangerous capabilities

Mitigation strategy

1. Apply advanced mechanistic interpretability techniques, such as logit lens analysis, to directly decode and reconstruct the model's hidden reasoning (Chain-of-Thought) from its latent activations, complemented by automated paraphrasing to ensure human-readability of the decoded transcript. 2. Institute comprehensive, real-time output monitoring using baseline deviation and adversarial detection systems (steganalysis) to identify statistically anomalous output patterns, communication flows, or agent reasoning behaviors indicative of covert encoding attempts. 3. Implement regular system reset protocols to prevent the accumulation of persistent steganographic information across sessions and utilize cryptographic verification for critical outputs to confirm integrity and absence of hidden data modifications.