9 canonical risk pages
Security
Technical and cybersecurity attack vectors affecting model integrity, control, and resilience.
Data Poisoning
Attack involving the deliberate injection of malicious or manipulated data into the training set to introduce unwanted behaviors, backdoors, or specific biases into the model.
Direct Jailbreak
Set of adversarial techniques designed to force the model to ignore its ethical restrictions, content filters, and safety guidelines established during training.
Hidden Backdoors
Hidden malicious triggers inserted into models that activate dangerous or unauthorized behaviors only under specific conditions.
Prompt Injection
Attack technique where user inputs are manipulated to bypass security filters, content controls, and model behavioral restrictions (also known as Jailbreaking).
Adversarial Examples
Imperceptible perturbations intentionally added to inputs that cause dramatic misclassifications in the model (e.g., noise that makes a panda classified as a gibbon).
Evasion Attacks
Subtle and adversarial modifications to inputs designed to deceive classifiers or detection systems, exploiting vulnerabilities in the model's representation.
Model Extraction
Theft of a proprietary model's functionality through strategic queries to its API, allowing the recreation of an equivalent model without access to the original.
Model Obfuscation
Practices of intentional hiding of architectures, weights, or datasets of models to avoid independent security audit and public scrutiny.
Sponge Attack
Attacks via specially designed queries that consume disproportionate computational resources, causing Denial of Service (DoS).