All AI risk categories

9 canonical risk pages

Security

Technical and cybersecurity attack vectors affecting model integrity, control, and resilience.

DpSeverity 8/10

Data Poisoning

Attack involving the deliberate injection of malicious or manipulated data into the training set to introduce unwanted behaviors, backdoors, or specific biases into the model.

JbSeverity 8/10

Direct Jailbreak

Set of adversarial techniques designed to force the model to ignore its ethical restrictions, content filters, and safety guidelines established during training.

BdSeverity 8/10

Hidden Backdoors

Hidden malicious triggers inserted into models that activate dangerous or unauthorized behaviors only under specific conditions.

PiSeverity 8/10

Prompt Injection

Attack technique where user inputs are manipulated to bypass security filters, content controls, and model behavioral restrictions (also known as Jailbreaking).

AvSeverity 7/10

Adversarial Examples

Imperceptible perturbations intentionally added to inputs that cause dramatic misclassifications in the model (e.g., noise that makes a panda classified as a gibbon).

EvSeverity 7/10

Evasion Attacks

Subtle and adversarial modifications to inputs designed to deceive classifiers or detection systems, exploiting vulnerabilities in the model's representation.

ExSeverity 7/10

Model Extraction

Theft of a proprietary model's functionality through strategic queries to its API, allowing the recreation of an equivalent model without access to the original.

ObSeverity 7/10

Model Obfuscation

Practices of intentional hiding of architectures, weights, or datasets of models to avoid independent security audit and public scrutiny.

SoSeverity 7/10

Sponge Attack

Attacks via specially designed queries that consume disproportionate computational resources, causing Denial of Service (DoS).