Back to the MIT repository
2. Privacy & Security2 - Post-deployment

“Model Psychology” Attacks

LLMs are vulnerable to “psychological” tricks (Li et al., 2023e; Shen et al., 2023), which can be exploited by attackers. Examples include instructing the model to behave like a specific persona (Shah et al., 2023; Andreas, 2022), or employing various “social engineering” tricks crafted by humans (Wei et al., 2023c) or other LLMs (Perez et al., 2022b; Casper et al., 2023c).

Source: MIT AI Risk Repositorymit1504

ENTITY

1 - Human

INTENT

1 - Intentional

TIMING

2 - Post-deployment

Risk ID

mit1504

Domain lineage

2. Privacy & Security

186 mapped risks

2.2 > AI system security vulnerabilities and attacks

Mitigation strategy

1. Reinforce and expand adversarial training and alignment protocols by systematically incorporating sophisticated **persona-modulation** and **social engineering prompt templates** to enhance model robustness against targeted psychological jailbreaking vectors, thereby strengthening initial safety guardrails (Source 6). 2. Implement **process-based supervision** and **Informed Oversight** techniques to ensure model alignment is maintained throughout intermediate reasoning steps, which is critical for mitigating attacks that exploit the model's internal state or "psychology" via complex, multi-step prompt structures (Source 9). 3. Establish a systematic evaluation framework utilizing **LLM Psychometrics** principles to develop comprehensive, human-centered benchmarks that assess and validate model integrity and resistance to social influence, role-playing commands, and inherent social biases (Source 5).