Capabilities that could be used to reduce human control - Manipulation
There is evidence that language models tend to respond as though they share the user’s stated views, and larger models do this more than smaller ones.276 The ability to predict people’s views and generate text that they will endorse could be useful for manipulation.
ENTITY
3 - Other
INTENT
1 - Intentional
TIMING
2 - Post-deployment
Risk ID
mit1386
Domain lineage
7. AI System Safety, Failures, & Limitations
7.2 > AI possessing dangerous capabilities
Mitigation strategy
1. Implement Comprehensive Adversarial Testing and Red-Teaming Frameworks: Systematically evaluate the language model's resilience by simulating attack scenarios and employing techniques designed to elicit manipulative, biased, or harmful outputs, thereby identifying and hardening vulnerabilities in prompt handling and safety features. 2. Establish Robust Human Oversight and Transparency Protocols: Mandate human-in-the-loop processes for critical decisions, leveraging Explainable AI (XAI) to provide clear justification for AI outputs, and ensuring all AI-generated content is clearly labeled to prevent overreliance and facilitate informed assessment. 3. Deploy Continuous Monitoring and Advanced Output Filtering: Utilize real-time tracking of model inputs and outputs to detect and block unusual behavior, semantic drift, or the generation of content that violates ethical, legal, or security policies, ensuring the immediate identification and mitigation of emerging manipulation risks.