7. AI System Safety, Failures, & Limitations2 - Post-deployment

Steganography

Steganography. In the near future we will likely see LLMs communicating with each other to jointly accomplish tasks. To try to prevent collusion, we could monitor and constrain their communication (e.g., to be in natural language). However, models might secretly learn to communicate by concealing messages within other, non-secret text. Recent work on steganography using ML has demonstrated that this concern is well-founded (Hu et al., 2018; Mathew et al., 2024; Roger & Greenblatt, 2023; Schroeder de Witt et al., 2023b; Yang et al., 2019, see also Case Study 5). Secret communication could also occur via text compression (OpenAI, 2023c), or via the emergence of communication between agents where the symbols used by agents lack any predefined meanings or usage guidelines or are otherwise uninterpretable to humans (Foerster et al., 2016; Lazaridou & Baroni, 2020; Sukhbaatar et al., 2016).

Source: MIT AI Risk Repositorymit1216

ENTITY

2 - AI

INTENT

1 - Intentional

TIMING

2 - Post-deployment

Risk ID

mit1216

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.6 > Multi-agent risks

Mitigation strategy

1. Implement rigorous training-time mitigation mechanisms, including the correction of misspecified reward incentives, to prevent the unintended emergence of steganographic collusion during model optimization. 2. Develop and deploy advanced "white-box" steganalytic frameworks that leverage access to LLMs' internal states, execution traces, and grounding functions to statistically detect covert communication channels post-deployment. 3. Enhance active oversight measures by implementing adversarial perturbation analyses or other techniques that aggressively reduce the semantic entropy available for steganographic encoding in agent-to-agent communications.