Exploiting Limited Generalization of Safety Finetuning
Safety tuning is performed over a much narrower distribution compared to the pretraining distribution. This leaves the model vulnerable to attacks that exploit gaps in the generalization of the safety training, e.g. using encoded text (Wei et al., 2023c) or low-resource languages (Deng et al., 2023a; Yong et al., 2023) (see also Section 3.2).
ENTITY
3 - Other
INTENT
2 - Unintentional
TIMING
3 - Other
Risk ID
mit1503
Domain lineage
2. Privacy & Security
2.2 > AI system security vulnerabilities and attacks
Mitigation strategy
1. Enhance Safety Alignment Generalization Proactively address the narrow distribution vulnerability by (a) augmenting safety training datasets to encompass a broader semantic space, including multi-turn and semantically related toxic content, and (b) integrating comprehensive low-resource language data during the initial safety fine-tuning phase to close observed cross-lingual robustness gaps. 2. Implement Multi-Layered Prompt Transformation and Validation Establish a sequence of input-level defenses to neutralize adversarial inputs prior to inference. This framework must comprise: (a) stringent input sanitization and normalization to filter common malicious payloads and token patterns, and (b) controlled linguistic perturbation (e.g., light paraphrasing or retokenization) to disrupt adversarial token sequences while preserving semantic integrity. 3. Reinforce Alignment Persistence via Architectural and Post-Tuning Methods Integrate model-level solutions to ensure safety alignment is resilient to subsequent fine-tuning or adversarial input. This includes (a) utilizing a regularized fine-tuning objective to constrain parameter drift on safety-critical weights, and (b) employing inference-time methods, such as logit-based steering or adaptive perturbation, to reinforce refusal behavior and maintain safety performance.