Privacy and data collection concerns (collecting personal information or personally identifiable information)
Generative AI developers train their models with extensive datasets often gathered through online web scraping of websites that may include personal data or personally identifiable information (PII). For most generative AI applications, such as initial model training, the primary concerns are the quantity, variety, and quality of the data, not whether they include personally identifiable information. However, some web-scraped datasets may inadvertently include personal data. Additionally, when downstream developers integrate generative AI into their products or services by fine- tuning a pre-trained model, they often use their own in-house data, which may include personal information.
ENTITY
1 - Human
INTENT
2 - Unintentional
TIMING
1 - Pre-deployment
Risk ID
mit745
Domain lineage
2. Privacy & Security
2.1 > Compromise of privacy by leaking or correctly inferring sensitive information
Mitigation strategy
1. Implement Proactive Data De-identification and Pseudonymization Employ techniques such as PII masking, tokenization, or cryptographic pseudonymization directly within the data ingestion pipeline (privacy-safe by design) to transform or remove personal identifiers from web-scraped and proprietary fine-tuning datasets prior to long-term storage and model training. This must be layered and tested for re-identification risk. 2. Adhere to Data Minimization Principles and Exclusion of Sensitive Data Systematically limit the collection and use of personal or confidential data to only what is strictly necessary for the intended Generative AI application, and implement automated filters and policies to entirely exclude high-risk or confidential data from the model's training and input datasets when technically feasible. 3. Establish Robust Model and Data Access Governance Institute a comprehensive model governance framework that includes strict role-based access controls and multi-factor authentication for all data repositories and model environments. This action limits unauthorized access to sensitive training data and proprietary model parameters, mitigating risks associated with unintentional leakage and third-party exposure.