Back to the MIT repository
2. Privacy & Security1 - Pre-deployment

Jailbreak in LLM Malicious Use - White & Black Box Attacks

In the fine-tuning and alignment phase, elaborately- designed instruction datasets can be utilized to fine-tune LLMs to drive them to perform undesirable behaviors, such as generating harmful information or content that violates ethical norms, and thus achieve a jailbreak. Based on the accessibility to the model parameters, we can categorize them into white-box and black-box attacks. For white-box attacks, we can jailbreak the model by modifying its parameter weights. In [107], Lermen et al. used LoRA to fine-tune the Llama2’s 7B, 13B, and 70B as well as Mixtral on AdvBench and RefusalBench datasets. The test results show that the fine-tuned model has significantly lower rejection rates on harmful instructions, which indicates a successful jailbreak. Other works focus on jailbreaking in black-box models. In [160], Qi et al. first constructed harmful prompt-output pairs and fine-tuned black-box models such as GPT-3.5 Turbo. The results show that they were able to successfully bypass the security of GPT-3.5 Turbo with only a small number of adversarial training examples, which suggests that even if the model has good security properties in its initial state, it may be much less secure after user-customized fine-tuning.

Source: MIT AI Risk Repositorymit1517

ENTITY

1 - Human

INTENT

1 - Intentional

TIMING

1 - Pre-deployment

Risk ID

mit1517

Domain lineage

2. Privacy & Security

186 mapped risks

2.2 > AI system security vulnerabilities and attacks

Mitigation strategy

1. Employ specialized adversarial and defensive fine-tuning strategies, such as Backdoor Enhanced Safety Alignment (BESA) or Intent-FT, to significantly reinforce the LLM's safety alignment against both white-box and fine-tuning-based black-box attacks, even with limited safety examples. 2. Establish a rigorous, multi-layered input sanitization and output filtering pipeline, utilizing pre-processing layers, anomaly detection systems, and content moderation tools to detect and block malicious jailbreak prompts and the generation of harmful content. 3. Integrate fine-tuning vulnerability assessment into the regular security audit and penetration testing regimen, alongside stringent access controls (e.g., Role-Based Access Control and Multi-Factor Authentication) for any parameter access or model-modifying capabilities.