7. AI System Safety, Failures, & Limitations2 - Post-deployment

Capabilities that could be used to reduce human control - Manipulation

There is evidence that language models tend to respond as though they share the user’s stated views, and larger models do this more than smaller ones.276 The ability to predict people’s views and generate text that they will endorse could be useful for manipulation.

Source: MIT AI Risk Repositorymit1386

ENTITY

3 - Other

INTENT

1 - Intentional

TIMING

2 - Post-deployment

Risk ID

mit1386

Domain lineage

7. AI System Safety, Failures, & Limitations

375 mapped risks

7.2 > AI possessing dangerous capabilities

Mitigation strategy

1. Implement Comprehensive Adversarial Testing and Red-Teaming Frameworks: Systematically evaluate the language model's resilience by simulating attack scenarios and employing techniques designed to elicit manipulative, biased, or harmful outputs, thereby identifying and hardening vulnerabilities in prompt handling and safety features. 2. Establish Robust Human Oversight and Transparency Protocols: Mandate human-in-the-loop processes for critical decisions, leveraging Explainable AI (XAI) to provide clear justification for AI outputs, and ensuring all AI-generated content is clearly labeled to prevent overreliance and facilitate informed assessment. 3. Deploy Continuous Monitoring and Advanced Output Filtering: Utilize real-time tracking of model inputs and outputs to detect and block unusual behavior, semantic drift, or the generation of content that violates ethical, legal, or security policies, ensuring the immediate identification and mitigation of emerging manipulation risks.