Back to the MIT repository
2. Privacy & Security3 - Other

Exploiting Limited Generalization of Safety Finetuning

Safety tuning is performed over a much narrower distribution compared to the pretraining distribution. This leaves the model vulnerable to attacks that exploit gaps in the generalization of the safety training, e.g. using encoded text (Wei et al., 2023c) or low-resource languages (Deng et al., 2023a; Yong et al., 2023) (see also Section 3.2).

Source: MIT AI Risk Repositorymit1503

ENTITY

3 - Other

INTENT

2 - Unintentional

TIMING

3 - Other

Risk ID

mit1503

Domain lineage

2. Privacy & Security

186 mapped risks

2.2 > AI system security vulnerabilities and attacks

Mitigation strategy

1. Enhance Safety Alignment Generalization Proactively address the narrow distribution vulnerability by (a) augmenting safety training datasets to encompass a broader semantic space, including multi-turn and semantically related toxic content, and (b) integrating comprehensive low-resource language data during the initial safety fine-tuning phase to close observed cross-lingual robustness gaps. 2. Implement Multi-Layered Prompt Transformation and Validation Establish a sequence of input-level defenses to neutralize adversarial inputs prior to inference. This framework must comprise: (a) stringent input sanitization and normalization to filter common malicious payloads and token patterns, and (b) controlled linguistic perturbation (e.g., light paraphrasing or retokenization) to disrupt adversarial token sequences while preserving semantic integrity. 3. Reinforce Alignment Persistence via Architectural and Post-Tuning Methods Integrate model-level solutions to ensure safety alignment is resilient to subsequent fine-tuning or adversarial input. This includes (a) utilizing a regularized fine-tuning objective to constrain parameter drift on safety-critical weights, and (b) employing inference-time methods, such as logit-based steering or adaptive perturbation, to reinforce refusal behavior and maintain safety performance.