Back to the MIT repository
2. Privacy & Security3 - Other

Novel Attacks on LLMs

Table of examples has: Prompt Abstraction Attacks [147]: Abstracting queries to cost lower prices using LLM’s API. Reward Model Backdoor Attacks [148]: Constructing backdoor triggers on LLM’s RLHF process. LLM-based Adversarial Attacks [149]: Exploiting LLMs to construct samples for model attacks

Source: MIT AI Risk Repositorymit49

ENTITY

1 - Human

INTENT

1 - Intentional

TIMING

3 - Other

Risk ID

mit49

Domain lineage

2. Privacy & Security

186 mapped risks

2.2 > AI system security vulnerabilities and attacks

Mitigation strategy

1. **Mandatory Input and Output Validation** Implement rigorous input validation and sanitization pipelines to detect and neutralize adversarial inputs before processing by the LLM. Concurrently, establish clear output constraints and utilize comprehensive monitoring (e.g., semantic filters, RAG Triad assessment) to validate responses against safety and expected format criteria. 2. **Training Data and Model Integrity Verification** Employ provenance tracking and integrity checks on training and fine-tuning datasets, particularly those used in Reinforcement Learning from Human Feedback (RLHF), to proactively prevent model poisoning and the construction of backdoor triggers. Strengthen supplier validation for third-party models. 3. **Systemic Resource Throttling and Control** Implement granular rate limiting, resource allocation management, and timeouts/throttling mechanisms on the LLM API to mitigate economic exploitation risks such as Prompt Abstraction Attacks aimed at reducing operational costs.