Security
How to design AGIs that are robust to adversaries and adversarial environ- ments? This involves building sandboxed AGI protected from adversaries (Berkeley), and agents that are robust to adversarial inputs (Berkeley, DeepMind).
ENTITY
1 - Human
INTENT
2 - Unintentional
TIMING
1 - Pre-deployment
Risk ID
mit831
Domain lineage
2. Privacy & Security
2.2 > AI system security vulnerabilities and attacks
Mitigation strategy
1. Establish Mandatory Local and Global Containment Architectures Implement strict sandboxing for all individual agents (local sandboxes) in addition to the broader agentic environment (global sandbox). These local sandboxes must enforce rigorous controls, only permitting external interactions after satisfying automated local safety checks, ensuring localized containment of misaligned capabilities and preventing system-wide compromise from adversarial environments. 2. Enforce Certified Adversarial Robustness Standards Develop and mandate minimum standards for resistance against adversarial inputs and sudden environmental shifts. Individual agents must be certified as meeting these robustness requirements through formally verifiable certificates, which require periodic re-certification to keep pace with evolving threats and benchmarking capabilities. 3. Implement Continuous and Hierarchical Monitoring for Systemic Risk Deploy robust, real-time monitoring and oversight systems to track key risk indicators, audit agent actions, and provide full interpretability of decision processes. This hierarchical system must be capable of detecting emergent intelligence cores, flagging issues for external oversight (human or automated), and triggering circuit breakers or interruptibility mechanisms to halt potentially harmful distributed computation.