Copyright challenges (training models using copyrighted output)
Generative AI companies are regularly accused of violating copyright law by training AI models on copyrighted works without gaining permission or paying compensation to the copyright owners. In fact, a substantial number of copyrighted documents and books have been incorporated into the training datasets of generative AI models.
ENTITY
1 - Human
INTENT
1 - Intentional
TIMING
1 - Pre-deployment
Risk ID
mit747
Domain lineage
6. Socioeconomic and Environmental
6.3 > Economic and cultural devaluation of human effort
Mitigation strategy
1. Mandate and verify the legal provenance of all training datasets, prioritizing content acquired through explicit licensing agreements or demonstrably public domain sources. Conduct comprehensive due diligence on third-party AI models to ensure vendor compliance with intellectual property laws and secure robust contractual indemnification against infringement claims related to training data or model output. 2. Establish a comprehensive, cross-functional internal AI governance policy that clearly defines permissible and prohibited uses of generative AI tools. The policy must mandate rigorous, pre-deployment testing and validation, and require the use of content-review processes to detect and prevent the publication of AI outputs substantially similar to copyrighted works. 3. Implement technical safeguards during model development, such as similarity percentage "filters" to limit the weight of any single piece of training data, thereby reducing the risk of verbatim "regurgitation." Furthermore, deploy advanced plagiarism and intellectual property detection tools to automatically screen model outputs for unauthorized replication of protected works.