“Model Psychology” Attacks
LLMs are vulnerable to “psychological” tricks (Li et al., 2023e; Shen et al., 2023), which can be exploited by attackers. Examples include instructing the model to behave like a specific persona (Shah et al., 2023; Andreas, 2022), or employing various “social engineering” tricks crafted by humans (Wei et al., 2023c) or other LLMs (Perez et al., 2022b; Casper et al., 2023c).
ENTITY
1 - Human
INTENT
1 - Intentional
TIMING
2 - Post-deployment
Risk ID
mit1504
Domain lineage
2. Privacy & Security
2.2 > AI system security vulnerabilities and attacks
Mitigation strategy
1. Reinforce and expand adversarial training and alignment protocols by systematically incorporating sophisticated **persona-modulation** and **social engineering prompt templates** to enhance model robustness against targeted psychological jailbreaking vectors, thereby strengthening initial safety guardrails (Source 6). 2. Implement **process-based supervision** and **Informed Oversight** techniques to ensure model alignment is maintained throughout intermediate reasoning steps, which is critical for mitigating attacks that exploit the model's internal state or "psychology" via complex, multi-step prompt structures (Source 9). 3. Establish a systematic evaluation framework utilizing **LLM Psychometrics** principles to develop comprehensive, human-centered benchmarks that assess and validate model integrity and resistance to social influence, role-playing commands, and inherent social biases (Source 5).