9 canonical risk pages

Security

Technical and cybersecurity attack vectors affecting model integrity, control, and resilience.

Data Poisoning

Attack involving the deliberate injection of malicious or manipulated data into the training set to introduce unwanted behaviors, backdoors, or specific biases into the model.

JbSeverity 8/10

Direct Jailbreak

Set of adversarial techniques designed to force the model to ignore its ethical restrictions, content filters, and safety guidelines established during training.

BdSeverity 8/10

Hidden Backdoors

Hidden malicious triggers inserted into models that activate dangerous or unauthorized behaviors only under specific conditions.

PiSeverity 8/10

Prompt Injection

Attack technique where user inputs are manipulated to bypass security filters, content controls, and model behavioral restrictions (also known as Jailbreaking).

AvSeverity 7/10

Adversarial Examples

Imperceptible perturbations intentionally added to inputs that cause dramatic misclassifications in the model (e.g., noise that makes a panda classified as a gibbon).

EvSeverity 7/10

Evasion Attacks

Subtle and adversarial modifications to inputs designed to deceive classifiers or detection systems, exploiting vulnerabilities in the model's representation.

ExSeverity 7/10

Model Extraction

Theft of a proprietary model's functionality through strategic queries to its API, allowing the recreation of an equivalent model without access to the original.

ObSeverity 7/10

Model Obfuscation

Practices of intentional hiding of architectures, weights, or datasets of models to avoid independent security audit and public scrutiny.

SoSeverity 7/10

Sponge Attack

Attacks via specially designed queries that consume disproportionate computational resources, causing Denial of Service (DoS).