Generative AI Training and Copyright Law
- URL: http://arxiv.org/abs/2502.15858v1
- Date: Fri, 21 Feb 2025 08:45:14 GMT
- Title: Generative AI Training and Copyright Law
- Authors: Tim W. Dornis, Sebastian Stober,
- Abstract summary: Training generative AI models requires extensive amounts of data.<n>A common practice is to collect such data through web scraping. Yet, much of what has been and is collected is copyright protected.<n>In the USA, AI developers rely on "fair use" and in Europe, the prevailing view is that the exception for "Text and Data Mining" (TDM) applies.
- Score: 0.1074267520911262
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Training generative AI models requires extensive amounts of data. A common practice is to collect such data through web scraping. Yet, much of what has been and is collected is copyright protected. Its use may be copyright infringement. In the USA, AI developers rely on "fair use" and in Europe, the prevailing view is that the exception for "Text and Data Mining" (TDM) applies. In a recent interdisciplinary tandem-study, we have argued in detail that this is actually not the case because generative AI training fundamentally differs from TDM. In this article, we share our main findings and the implications for both public and corporate research on generative models. We further discuss how the phenomenon of training data memorization leads to copyright issues independently from the "fair use" and TDM exceptions. Finally, we outline how the ISMIR could contribute to the ongoing discussion about fair practices with respect to generative AI that satisfy all stakeholders.
Related papers
- Could AI Trace and Explain the Origins of AI-Generated Images and Text? [53.11173194293537]
AI-generated content is increasingly prevalent in the real world.
adversaries might exploit large multimodal models to create images that violate ethical or legal standards.
Paper reviewers may misuse large language models to generate reviews without genuine intellectual effort.
arXiv Detail & Related papers (2025-04-05T20:51:54Z) - Towards Best Practices for Open Datasets for LLM Training [21.448011162803866]
Many AI companies are training their large language models (LLMs) on data without the permission of the copyright owners.<n>Creative producers have led to several high-profile copyright lawsuits.<n>This trend in limiting data information causes harm by hindering transparency, accountability, and innovation.
arXiv Detail & Related papers (2025-01-14T17:18:05Z) - Data Shapley in One Training Run [88.59484417202454]
Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts.
Existing approaches require re-training models on different data subsets, which is computationally intensive.
This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest.
arXiv Detail & Related papers (2024-06-16T17:09:24Z) - An Economic Solution to Copyright Challenges of Generative AI [35.37023083413299]
Generative artificial intelligence systems are trained to generate new pieces of text, images, videos, and other media.
There is growing concern that such systems may infringe on the copyright interests of training data contributors.
We propose a framework that compensates copyright owners proportionally to their contributions to the creation of AI-generated content.
arXiv Detail & Related papers (2024-04-22T08:10:38Z) - The Files are in the Computer: Copyright, Memorization, and Generative AI [2.1178416840822027]
The New York Times's copyright lawsuit against OpenAI and Microsoft alleges OpenAI's GPT models have "memorized" NYT articles.
These debates are clouded by ambiguities over the nature of "memorization"
We draw on the technical literature to provide a precise definition of memorization.
arXiv Detail & Related papers (2024-04-19T02:37:09Z) - Generative AI and Copyright: A Dynamic Perspective [0.0]
generative AI is poised to disrupt the creative industry.
The compensation to creators whose content has been used to train generative AI models (the fair use standard) and the eligibility of AI-generated content for copyright protection (AI-copyrightability) are key issues.
This paper aims to better understand the economic implications of these two regulatory issues and their interactions.
arXiv Detail & Related papers (2024-02-27T07:12:48Z) - Copyright Protection in Generative AI: A Technical Perspective [58.84343394349887]
Generative AI has witnessed rapid advancement in recent years, expanding their capabilities to create synthesized content such as text, images, audio, and code.
The high fidelity and authenticity of contents generated by these Deep Generative Models (DGMs) have sparked significant copyright concerns.
This work delves into this issue by providing a comprehensive overview of copyright protection from a technical perspective.
arXiv Detail & Related papers (2024-02-04T04:00:33Z) - A Dataset and Benchmark for Copyright Infringement Unlearning from Text-to-Image Diffusion Models [52.49582606341111]
Copyright law confers creators the exclusive rights to reproduce, distribute, and monetize their creative works.
Recent progress in text-to-image generation has introduced formidable challenges to copyright enforcement.
We introduce a novel pipeline that harmonizes CLIP, ChatGPT, and diffusion models to curate a dataset.
arXiv Detail & Related papers (2024-01-04T11:14:01Z) - Training Is Everything: Artificial Intelligence, Copyright, and Fair
Training [9.653656920225858]
Authors: Companies that use such content to train their AI engine often believe such usage should be considered "fair use"
Authors: Copyright owners, as well as their supporters, consider the incorporation of copyrighted works into training sets for AI to constitute misappropriation of owners' intellectual property.
We identify both strong and spurious arguments on both sides of this debate.
arXiv Detail & Related papers (2023-05-04T04:01:00Z) - Foundation Models and Fair Use [96.04664748698103]
In the U.S. and other countries, copyrighted content may be used to build foundation models without incurring liability due to the fair use doctrine.
In this work, we survey the potential risks of developing and deploying foundation models based on copyrighted content.
We discuss technical mitigations that can help foundation models stay in line with fair use.
arXiv Detail & Related papers (2023-03-28T03:58:40Z) - Should Machine Learning Models Report to Us When They Are Clueless? [0.0]
We report that AI models extrapolate outside their range of familiar data.
Knowing whether a model has extrapolated or not is a fundamental insight that should be included in explaining AI models.
arXiv Detail & Related papers (2022-03-23T01:50:24Z) - Representative & Fair Synthetic Data [68.8204255655161]
We present a framework to incorporate fairness constraints into the self-supervised learning process.
We generate a representative as well as fair version of the UCI Adult census data set.
We consider representative & fair synthetic data a promising future building block to teach algorithms not on historic worlds, but rather on the worlds that we strive to live in.
arXiv Detail & Related papers (2021-04-07T09:19:46Z) - Decentralized Federated Learning Preserves Model and Data Privacy [77.454688257702]
We propose a fully decentralized approach, which allows to share knowledge between trained models.
Students are trained on the output of their teachers via synthetically generated input data.
The results show that an untrained student model, trained on the teachers output reaches comparable F1-scores as the teacher.
arXiv Detail & Related papers (2021-02-01T14:38:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.