2. Privacy & Security1 - Pre-deployment

Backdoors or trojan attacks in GPAI models

Backdoors can be inserted into GPAI models during their training or fine-tuning, to be exploited during deployment [185, 118]. Attackers inserting the backdoor can be the GPAI model provider themselves or another actor (e.g., by ma- nipulating the training data or the software infrastructure used by the model provider) [222]. Some backdoors can be exploited with minimal overhead, al- lowing attackers to control the model outputs in a targeted way with a high success rate [90].

Source: MIT AI Risk Repositorymit1139

ENTITY

1 - Human

INTENT

1 - Intentional

TIMING

1 - Pre-deployment

Risk ID

mit1139

Domain lineage

2. Privacy & Security

186 mapped risks

2.2 > AI system security vulnerabilities and attacks

Mitigation strategy

1. Establish Comprehensive Data Provenance and Sanitization Implement rigorous data governance to track the lineage and integrity of every training data sample from source to pipeline. This includes automated and manual audits utilizing statistical profiling and advanced anomaly detection to identify and quarantine poisoned or aberrant data points that may embed a backdoor trigger, thereby mitigating the primary vector of data-level attacks. 2. Mandate Model Integrity Verification via Cryptographic Signing Require cryptographic signing of the General-Purpose AI (GPAI) model's weights and architecture at all key lifecycle stages, including fine-tuning and pre-deployment. This control ensures that the model binary has not been tampered with or replaced by a trojaned version by a malicious actor within the AI supply chain before it is released for integration. 3. Deploy Adversarial Red Teaming and Activation-Based Detection Conduct proactive, continuous adversarial testing ("red teaming") utilizing sophisticated techniques such as trigger inversion or gradient-pattern analysis to search for hidden, unexpected input-output associations that indicate a dormant backdoor. Supplement this with internal model interpretability methods (e.g., semantic entropy probes or activation analysis) to monitor for anomalous layer representations during inference, signaling an active exploitation.