Novel Attacks on LLMs
Table of examples has: Prompt Abstraction Attacks [147]: Abstracting queries to cost lower prices using LLM’s API. Reward Model Backdoor Attacks [148]: Constructing backdoor triggers on LLM’s RLHF process. LLM-based Adversarial Attacks [149]: Exploiting LLMs to construct samples for model attacks
ENTITY
1 - Human
INTENT
1 - Intentional
TIMING
3 - Other
Risk ID
mit49
Domain lineage
2. Privacy & Security
2.2 > AI system security vulnerabilities and attacks
Mitigation strategy
1. **Mandatory Input and Output Validation** Implement rigorous input validation and sanitization pipelines to detect and neutralize adversarial inputs before processing by the LLM. Concurrently, establish clear output constraints and utilize comprehensive monitoring (e.g., semantic filters, RAG Triad assessment) to validate responses against safety and expected format criteria. 2. **Training Data and Model Integrity Verification** Employ provenance tracking and integrity checks on training and fine-tuning datasets, particularly those used in Reinforcement Learning from Human Feedback (RLHF), to proactively prevent model poisoning and the construction of backdoor triggers. Strengthen supplier validation for third-party models. 3. **Systemic Resource Throttling and Control** Implement granular rate limiting, resource allocation management, and timeouts/throttling mechanisms on the LLM API to mitigate economic exploitation risks such as Prompt Abstraction Attacks aimed at reducing operational costs.