Jailbreak in LLM Malicious Use - White & Black Box Attacks
In the fine-tuning and alignment phase, elaborately- designed instruction datasets can be utilized to fine-tune LLMs to drive them to perform undesirable behaviors, such as generating harmful information or content that violates ethical norms, and thus achieve a jailbreak. Based on the accessibility to the model parameters, we can categorize them into white-box and black-box attacks. For white-box attacks, we can jailbreak the model by modifying its parameter weights. In [107], Lermen et al. used LoRA to fine-tune the Llama2’s 7B, 13B, and 70B as well as Mixtral on AdvBench and RefusalBench datasets. The test results show that the fine-tuned model has significantly lower rejection rates on harmful instructions, which indicates a successful jailbreak. Other works focus on jailbreaking in black-box models. In [160], Qi et al. first constructed harmful prompt-output pairs and fine-tuned black-box models such as GPT-3.5 Turbo. The results show that they were able to successfully bypass the security of GPT-3.5 Turbo with only a small number of adversarial training examples, which suggests that even if the model has good security properties in its initial state, it may be much less secure after user-customized fine-tuning.
ENTITY
1 - Human
INTENT
1 - Intentional
TIMING
1 - Pre-deployment
Risk ID
mit1517
Domain lineage
2. Privacy & Security
2.2 > AI system security vulnerabilities and attacks
Mitigation strategy
1. Employ specialized adversarial and defensive fine-tuning strategies, such as Backdoor Enhanced Safety Alignment (BESA) or Intent-FT, to significantly reinforce the LLM's safety alignment against both white-box and fine-tuning-based black-box attacks, even with limited safety examples. 2. Establish a rigorous, multi-layered input sanitization and output filtering pipeline, utilizing pre-processing layers, anomaly detection systems, and content moderation tools to detect and block malicious jailbreak prompts and the generation of harmful content. 3. Integrate fine-tuning vulnerability assessment into the regular security audit and penetration testing regimen, alongside stringent access controls (e.g., Role-Based Access Control and Multi-Factor Authentication) for any parameter access or model-modifying capabilities.