2. Privacy & Security2 - Post-deployment

Text encoding-based attacks

Various new or existing text encodings, such as Base64, can be employed to craft jailbreak attacks that bypass safety training [13]. Low-resource language inputs also appear more likely to circumvent a model’s safeguards [229]. Since safety fine-tuning might not involve this encoding data or may only do so to a limited extent, harmful natural language prompts could be translated into less frequently used encodings [214].

Source: MIT AI Risk Repositorymit1140

ENTITY

1 - Human

INTENT

1 - Intentional

TIMING

2 - Post-deployment

Risk ID

mit1140

Domain lineage

2. Privacy & Security

186 mapped risks

2.2 > AI system security vulnerabilities and attacks

Mitigation strategy

1. **Implement Layered Content Safeguards:** Employ a robust, external content filtering service (e.g., Prompt Shields, Moderation APIs, Guardrails) as a primary defense to inspect and block adversarial input prompts and model outputs. This filter must be specifically designed to detect and prevent obfuscation techniques, including text encoding-based, character transformation, and ciphers, before the prompt reaches the core model. 2. **Conduct Multilingual Safety Alignment and Red-Teaming:** Systematically extend safety fine-tuning and red-teaming efforts beyond high-resource languages (like English) to comprehensively include low-resource languages. This addresses the cross-lingual vulnerability arising from linguistic inequality in safety training data, mitigating the primary exploitation vector for translation-based jailbreaks. 3. **Integrate Semantic-Based Input Analysis:** Utilize advanced techniques, such as vector databases and embeddings (e.g., Prompt Guarding), to assess the semantic intent and credibility of user input in real-time. This provides a secondary, context-aware defense against subtle manipulations and sophisticated adversarial prompts that evade basic pattern-matching filters.