We Should Separate Memorization from Copyright
- URL: http://arxiv.org/abs/2602.08632v1
- Date: Mon, 09 Feb 2026 13:24:06 GMT
- Title: We Should Separate Memorization from Copyright
- Authors: Adi Haviv, Niva Elkin-Koren, Uri Hacohen, Roi Livni, Shay Moran,
- Abstract summary: We argue that memorization should not be equated with copying and should not be used as a proxy for copyright infringement.<n>We advocate for an output-level, risk-based evaluation process that aligns technical assessments with established copyright standards.
- Score: 29.232307526669967
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The widespread use of foundation models has introduced a new risk factor of copyright issue. This issue is leading to an active, lively and on-going debate amongst the data-science community as well as amongst legal scholars. Where claims and results across both sides are often interpreted in different ways and leading to different implications. Our position is that much of the technical literature relies on traditional reconstruction techniques that are not designed for copyright analysis. As a result, memorization and copying have been conflated across both technical and legal communities and in multiple contexts. We argue that memorization, as commonly studied in data science, should not be equated with copying and should not be used as a proxy for copyright infringement. We distinguish technical signals that meaningfully indicate infringement risk from those that instead reflect lawful generalization or high-frequency content. Based on this analysis, we advocate for an output-level, risk-based evaluation process that aligns technical assessments with established copyright standards and provides a more principled foundation for research, auditing, and policy.
Related papers
- Dissecting Judicial Reasoning in U.S. Copyright Damage Awards [0.21485350418225238]
judicial reasoning in copyright damage awards poses a core challenge for computational legal analysis.<n>Federal courts follow the 1976 Copyright Act, their interpretations and factor weightings vary widely across jurisdictions.<n>This research introduces a novel discourse-based Large Language Model (LLM) methodology that integrates Rhetorical Structure Theory (RST) with an agentic workflow.
arXiv Detail & Related papers (2026-01-14T13:09:16Z) - Incorporating Legal Structure in Retrieval-Augmented Generation: A Case Study on Copyright Fair Use [44.99833362998488]
This paper presents a domain-specific implementation of Retrieval-Augmented Generation tailored to the Fair Use Doctrine in U.S. copyright law.<n>Motivated by the increasing prevalence of DMCA takedowns and the lack of accessible legal support for content creators, we propose a structured approach that combines semantic search with legal knowledge graphs and court citation networks to improve retrieval quality and reasoning reliability.
arXiv Detail & Related papers (2025-05-04T15:53:49Z) - Project Alexandria: Towards Freeing Scientific Knowledge from Copyright Burdens via LLMs [26.952396644343537]
Paywalls, licenses and copyright rules often restrict the broad dissemination and reuse of scientific knowledge.<n>We take the position that it is both legally and technically feasible to extract the scientific knowledge in scholarly texts.<n>We propose a new idea for the community to adopt: convert scholarly documents into knowledge preserving, but style agnostic representations.
arXiv Detail & Related papers (2025-02-26T18:56:52Z) - CopyJudge: Automated Copyright Infringement Identification and Mitigation in Text-to-Image Diffusion Models [58.58208005178676]
We propose CopyJudge, a novel automated infringement identification framework.<n>We employ an abstraction-filtration-comparison test framework to assess the likelihood of infringement.<n>We introduce a general LVLM-based mitigation strategy that automatically optimize infringing prompts.
arXiv Detail & Related papers (2025-02-21T08:09:07Z) - Can Watermarking Large Language Models Prevent Copyrighted Text Generation and Hide Training Data? [62.72729485995075]
We investigate the effectiveness of watermarking as a deterrent against the generation of copyrighted texts.<n>We find that watermarking adversely affects the success rate of Membership Inference Attacks (MIAs)<n>We propose an adaptive technique to improve the success rate of a recent MIA under watermarking.
arXiv Detail & Related papers (2024-07-24T16:53:09Z) - Evaluating Copyright Takedown Methods for Language Models [100.38129820325497]
Language models (LMs) derive their capabilities from extensive training on diverse data, including potentially copyrighted material.
This paper introduces the first evaluation of the feasibility and side effects of copyright takedowns for LMs.
We examine several strategies, including adding system prompts, decoding-time filtering interventions, and unlearning approaches.
arXiv Detail & Related papers (2024-06-26T18:09:46Z) - A Dataset and Benchmark for Copyright Infringement Unlearning from Text-to-Image Diffusion Models [52.49582606341111]
Copyright law confers creators the exclusive rights to reproduce, distribute, and monetize their creative works.
Recent progress in text-to-image generation has introduced formidable challenges to copyright enforcement.
We introduce a novel pipeline that harmonizes CLIP, ChatGPT, and diffusion models to curate a dataset.
arXiv Detail & Related papers (2024-01-04T11:14:01Z) - Can Copyright be Reduced to Privacy? [23.639303165101385]
We argue that while algorithmic stability may be perceived as a practical tool to detect copying, such copying does not necessarily constitute copyright infringement.
If adopted as a standard for detecting an establishing copyright infringement, algorithmic stability may undermine the intended objectives of copyright law.
arXiv Detail & Related papers (2023-05-24T07:22:41Z) - Foundation Models and Fair Use [96.04664748698103]
In the U.S. and other countries, copyrighted content may be used to build foundation models without incurring liability due to the fair use doctrine.
In this work, we survey the potential risks of developing and deploying foundation models based on copyrighted content.
We discuss technical mitigations that can help foundation models stay in line with fair use.
arXiv Detail & Related papers (2023-03-28T03:58:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.