Back to the MIT repository
6. Socioeconomic and Environmental2 - Post-deployment

Copyright

The memorization effect of LLM on training data can enable users to extract certain copyright-protected content that belongs to the LLM’s training data.

Source: MIT AI Risk Repositorymit496

ENTITY

1 - Human

INTENT

1 - Intentional

TIMING

2 - Post-deployment

Risk ID

mit496

Domain lineage

6. Socioeconomic and Environmental

262 mapped risks

6.3 > Economic and cultural devaluation of human effort

Mitigation strategy

1. Implement rigorous data provenance and de-duplication protocols across the entire training corpus to minimize the prevalence of high-frequency or exact-match copyrighted sequences, thereby systematically reducing the model's propensity for verbatim regurgitation. 2. Employ advanced training-time regularization, such as Differential Privacy (DP) or Goldfish Loss, to mathematically inhibit the model from tightly fitting individual data points, thereby preventing exact memorization of sensitive or proprietary text. 3. Deploy robust real-time inference guardrails and output filtering mechanisms to detect and block user-prompted requests that successfully elicit substantial near-verbatim segments of known copyrighted material.

ADDITIONAL EVIDENCE

In addition to copyrighted text, LLM can also generate code snippets that look similar to the licensed programs on GitHub,