Back to the MIT repository
2. Privacy & Security1 - Pre-deployment

Opaque Data Collection

When companies scrape personal information and use it to create generative AI tools, they undermine consumers' control of their personal information by using the information for a purpose for which the consumer did not consent.

Source: MIT AI Risk Repositorymit520

ENTITY

1 - Human

INTENT

1 - Intentional

TIMING

1 - Pre-deployment

Risk ID

mit520

Domain lineage

2. Privacy & Security

186 mapped risks

2.1 > Compromise of privacy by leaking or correctly inferring sensitive information

Mitigation strategy

1. Establish rigorous Purpose Limitation and Granular Consent frameworks that explicitly delineate the usage of personal information for generative AI training, ensuring compliance with "Processing Limitation" and "Purpose Specification" principles. This mandates obtaining informed, voluntary, and specific consent for AI model ingestion, separate from general service terms, to prevent repurposing data without user agreement. 2. Implement a strict Data Minimisation policy coupled with robust Anonymisation and De-identification techniques prior to data ingestion. This involves conducting thorough data protection impact assessments (DPIAs) to verify that only data "reasonably necessary" for the generative AI's legitimate purpose is collected, and that all sensitive identifiers are stripped or irreversibly obscured from training datasets to mitigate the risk of sensitive information inference or leakage. 3. Mandate proactive Transparency and comprehensive Data Lineage documentation for all training data. This includes maintaining an auditable record (Data Governance Policies) of the original source, consent status, and transformation history of any collected or scraped personal data. This provides the necessary "Openness" for external auditing and internal review to ensure ongoing adherence to consent flows and purpose limitations.