Backdoors or trojan attacks in GPAI models
Backdoors can be inserted into GPAI models during their training or fine-tuning, to be exploited during deployment [185, 118]. Attackers inserting the backdoor can be the GPAI model provider themselves or another actor (e.g., by ma- nipulating the training data or the software infrastructure used by the model provider) [222]. Some backdoors can be exploited with minimal overhead, al- lowing attackers to control the model outputs in a targeted way with a high success rate [90].
ENTITY
1 - Human
INTENT
1 - Intentional
TIMING
1 - Pre-deployment
Risk ID
mit1139
Domain lineage
2. Privacy & Security
2.2 > AI system security vulnerabilities and attacks
Mitigation strategy
1. Establish Comprehensive Data Provenance and Sanitization Implement rigorous data governance to track the lineage and integrity of every training data sample from source to pipeline. This includes automated and manual audits utilizing statistical profiling and advanced anomaly detection to identify and quarantine poisoned or aberrant data points that may embed a backdoor trigger, thereby mitigating the primary vector of data-level attacks. 2. Mandate Model Integrity Verification via Cryptographic Signing Require cryptographic signing of the General-Purpose AI (GPAI) model's weights and architecture at all key lifecycle stages, including fine-tuning and pre-deployment. This control ensures that the model binary has not been tampered with or replaced by a trojaned version by a malicious actor within the AI supply chain before it is released for integration. 3. Deploy Adversarial Red Teaming and Activation-Based Detection Conduct proactive, continuous adversarial testing ("red teaming") utilizing sophisticated techniques such as trigger inversion or gradient-pattern analysis to search for hidden, unexpected input-output associations that indicate a dormant backdoor. Supplement this with internal model interpretability methods (e.g., semantic entropy probes or activation analysis) to monitor for anomalous layer representations during inference, signaling an active exploitation.