Model weight leak
Model weights or access to them can be leaked when initial access is granted only to a select group of individuals, such as institutional researchers [209]. This risk can increase as more people gain access, and identifying the source of the leak becomes more difficult. The availability of leaked model weights makes various attacks on systems that use the leaked AI model easier to implement, such as finding adversarial examples, elicitation of dangerous capabilities, and extraction of confidential information present in the training data. The avail- ability of model weights might also enable the misuse of the AI system using the leaked model to produce harmful or illegal content [67].
ENTITY
1 - Human
INTENT
1 - Intentional
TIMING
2 - Post-deployment
Risk ID
mit1165
Domain lineage
2. Privacy & Security
2.2 > AI system security vulnerabilities and attacks
Mitigation strategy
1. **Implement the Principle of Least Privilege and Strict Access Control** Centralize all model weights to a limited number of rigorously monitored and access-controlled systems. The number of individuals authorized to access the weights must be reduced to the absolute minimum necessary, following the principle of least privilege (RBAC), and supplemented by a robust insider threat program to continuously audit personnel activity and permissions. 2. **Employ Cryptographic and Confidential Computing Techniques** Encrypt model weights both at rest and in transit using industry-standard algorithms. For enhanced security during inference, investigate and implement confidential computing environments to secure the weights and their usage against internal system compromises, thereby significantly reducing the attack surface for exfiltration. 3. **Establish Proactive Defense Validation and Monitoring** Conduct regular, advanced third-party red-teaming and penetration testing focused specifically on weight exfiltration vectors. Concurrently, implement a high-fidelity monitoring and audit trail system to log all access and query patterns, utilizing anomaly detection to flag and immediately investigate any unusual data upload rates or access attempts indicative of a potential leak.