Scraping to train data
When companies scrape personal information and use it to create generative AI tools, they undermine consumers’ control of their personal information by using the information for a purpose for which the consumer did not consent. The individual may not have even imagined their data could be used in the way the company intends when the person posted it online. Individual storing or hosting of scraped personal data may not always be harmful in a vacuum, but there are many risks. Multiple data sets can be combined in ways that cause harm: information that is not sensitive when spread across different databases can be extremely revealing when collected in a single place, and it can be used to make inferences about a person or population. And because scraping makes a copy of someone’s data as it existed at a specific time, the company also takes away the individual’s ability to alter or remove the information from the public sphere.
ENTITY
1 - Human
INTENT
1 - Intentional
TIMING
1 - Pre-deployment
Risk ID
mit521
Domain lineage
2. Privacy & Security
2.1 > Compromise of privacy by leaking or correctly inferring sensitive information
Mitigation strategy
- Implement a mandatory, explicit consent framework for the use of personal data in generative AI training, ensuring all data sources and uses comply with relevant privacy regulations and data subject expectations. - Employ advanced data processing techniques such as differential privacy and k-anonymity during the ingestion and training phases to minimize the risk of re-identification, inference, and the creation of sensitive profiles from combined, non-sensitive data points. - Establish a transparent, easily accessible mechanism for individuals to exercise their 'right to object' or 'right to erasure' regarding the inclusion of their public data in training datasets, coupled with auditable procedures for prompt data removal and exclusion.