Text encoding-based attacks
Various new or existing text encodings, such as Base64, can be employed to craft jailbreak attacks that bypass safety training [13]. Low-resource language inputs also appear more likely to circumvent a model’s safeguards [229]. Since safety fine-tuning might not involve this encoding data or may only do so to a limited extent, harmful natural language prompts could be translated into less frequently used encodings [214].
ENTITY
1 - Human
INTENT
1 - Intentional
TIMING
2 - Post-deployment
Risk ID
mit1140
Domain lineage
2. Privacy & Security
2.2 > AI system security vulnerabilities and attacks
Mitigation strategy
1. **Implement Layered Content Safeguards:** Employ a robust, external content filtering service (e.g., Prompt Shields, Moderation APIs, Guardrails) as a primary defense to inspect and block adversarial input prompts and model outputs. This filter must be specifically designed to detect and prevent obfuscation techniques, including text encoding-based, character transformation, and ciphers, before the prompt reaches the core model. 2. **Conduct Multilingual Safety Alignment and Red-Teaming:** Systematically extend safety fine-tuning and red-teaming efforts beyond high-resource languages (like English) to comprehensively include low-resource languages. This addresses the cross-lingual vulnerability arising from linguistic inequality in safety training data, mitigating the primary exploitation vector for translation-based jailbreaks. 3. **Integrate Semantic-Based Input Analysis:** Utilize advanced techniques, such as vector databases and embeddings (e.g., Prompt Guarding), to assess the semantic intent and credibility of user input in real-time. This provides a secondary, context-aware defense against subtle manipulations and sophisticated adversarial prompts that evade basic pattern-matching filters.