7. AI System Safety, Failures, & Limitations2 - Post-deployment

Threats and Extortion

Threats and Extortion. A natural solution to problems of trust is to provide some kind of com- mitment ability to AI agents, which can be used to bind them to more cooperative courses of action. Unfortunately, the ability to make credible commitments may come with the ability to make credible threats, which facilitate extortion and could incentivize brinkmanship (see Section 2.2).

Source: MIT AI Risk Repositorymit1237

ENTITY

2 - AI

INTENT

1 - Intentional

TIMING

2 - Post-deployment

Risk ID

mit1237

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.6 > Multi-agent risks

Mitigation strategy

1. Prioritize implementing **Least-Privilege Permissions and Policy Constraints** to restrict AI agent autonomy, ensuring agents possess only the minimal necessary capabilities and access to perform their assigned tasks. This prevents agents from having the unilateral power required to make credible, high-impact threats or engage in brinkmanship tactics outside of their defined operational boundaries. 2. Establish **Secure Inter-Agent Communication Protocols and Multi-Agent Consensus Verification** for mission-critical decisions. This includes deploying cryptographic message authentication, enforcing communication validation policies, and requiring consensus from multiple agents or human oversight before executing high-stakes actions, thereby mitigating the risk of extortion via manipulative or deceptive signaling. 3. Deploy **Continuous Behavioral Monitoring with Automated Anomaly Detection and Human-in-the-Loop 'Circuit Breakers'**. Real-time telemetry must track agent-to-agent interactions, resource consumption, and decision logs to identify deviations from baseline cooperative behavior, triggering immediate human intervention or automated de-escalation measures upon crossing predefined risk thresholds.