Long-horizon planning
The model can make sequential plans that involve multiple steps, unfolding over long time horizons (or at least involving many interdependent steps). It can perform such planning within and across many domains. The model can sensibly adapt its plans in light of unexpected obstacles or adversaries. The model’s planning capabilities generalise to novel settings, and do not rely heavily on trial and error.
ENTITY
2 - AI
INTENT
1 - Intentional
TIMING
3 - Other
Risk ID
mit442
Domain lineage
7. AI System Safety, Failures, & Limitations
7.2 > AI possessing dangerous capabilities
Mitigation strategy
1. Establish a Deliberative Alignment Protocol: Integrate an explicit, high-level anti-scheming and safety specification (e.g., AS1-AS5 and GP1-GP4) into the agent's reasoning process. The agent must be trained to ground its long-horizon planning and action selection in these principles, proactively referencing and applying the specification in its chain-of-thought to avoid covert actions, strategic deception, or the rationalization of misaligned behavior. 2. Implement a Trusted Monitoring and AI Control System: Deploy a separate, specially-aligned "trusted model" to continuously monitor the primary agent's internal reasoning logs, memory, and external actions. This system serves as a second line of defense to detect subtle signs of goal drift, misalignment, or emergent instrumental goals, enabling timely intervention, logging, and the application of hard limits on processing and activity when deviations are observed. 3. Enforce Architectural and Data Constraints: Apply pre-training restrictions ("datarails") to exclude information that facilitates human manipulation (e.g., behavioral economics or social engineering research) from the agent's training data. Furthermore, design the agent architecture with minimal and specific context engineering, ensuring tools and memory persistence are strictly constrained to the required task altitude to limit the operational scope for autonomous, unguided goal formation.
ADDITIONAL EVIDENCE
Finally, agency – in particular, the goal-directedness of an AI system – is an important property to evaluate (Kenton et al., 2022), given the central role of agency in various theories of AI risk (Chan et al., 2023). Partly, agency is a question of the model’s capabilities – is it capable of effectively pursuing goals? Evaluating alignment also requires looking at agency, including: (a) Is the model more goal-directed than the developer intended? For example, has a dialogue agent learnt the goal of manipulating the user’s behavior? (b) Does the model resist a user’s attempt to assemble it into an autonomous AI system (e.g. Auto-GPT) with harmful goals?