Related papers: LLMs and Memorization: On Quality and Specificity of Copyright Compliance

LLMs and Memorization: On Quality and Specificity of Copyright Compliance

URL: http://arxiv.org/abs/2405.18492v2
Date: Fri, 28 Jun 2024 16:12:39 GMT
Title: LLMs and Memorization: On Quality and Specificity of Copyright Compliance
Authors: Felix B Mueller, Rebekka Görge, Anna K Bernzen, Janna C Pirk, Maximilian Poretschkin,
Abstract summary: Memorization in large language models (LLMs) is a growing concern. LLMs have been shown to easily reproduce parts of their training data, including copyrighted work. This is an important problem to solve, as it may violate existing copyright laws as well as the European AI Act.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Memorization in large language models (LLMs) is a growing concern. LLMs have been shown to easily reproduce parts of their training data, including copyrighted work. This is an important problem to solve, as it may violate existing copyright laws as well as the European AI Act. In this work, we propose a systematic analysis to quantify the extent of potential copyright infringements in LLMs using European law as an example. Unlike previous work, we evaluate instruction-finetuned models in a realistic end-user scenario. Our analysis builds on a proposed threshold of 160 characters, which we borrow from the German Copyright Service Provider Act and a fuzzy text matching algorithm to identify potentially copyright-infringing textual reproductions. The specificity of countermeasures against copyright infringement is analyzed by comparing model behavior on copyrighted and public domain data. We investigate what behaviors models show instead of producing protected text (such as refusal or hallucination) and provide a first legal assessment of these behaviors. We find that there are huge differences in copyright compliance, specificity, and appropriate refusal among popular LLMs. Alpaca, GPT 4, GPT 3.5, and Luminous perform best in our comparison, with OpenGPT-X, Alpaca, and Luminous producing a particularly low absolute number of potential copyright violations. Code will be published soon.

Related papers

Certified Mitigation of Worst-Case LLM Copyright Infringement [46.571805194176825]
"copyright takedown" methods are aimed at preventing models from generating content substantially similar to copyrighted ones. We propose BloomScrub, a remarkably simple yet highly effective inference-time approach that provides certified copyright takedown. Our results suggest that lightweight, inference-time methods can be surprisingly effective for copyright prevention.
arXiv Detail & Related papers (2025-04-22T17:16:53Z)
Do LLMs Know to Respect Copyright Notice? [11.14140288980773]
We investigate whether language models infringe upon copyrights when processing user input containing protected material. Our study offers a conservative evaluation of the extent to which language models may infringe upon copyrights. This research emphasizes the need for further investigation and the importance of ensuring LLMs respect copyright regulations.
arXiv Detail & Related papers (2024-11-02T04:45:21Z)
Measuring Copyright Risks of Large Language Model via Partial Information Probing [14.067687792633372]
We explore the data sources used to train Large Language Models (LLMs) We input a portion of a copyrighted text into LLMs, prompt them to complete it, and then analyze the overlap between the generated content and the original copyrighted material. Our findings demonstrate that LLMs can indeed generate content highly overlapping with copyrighted materials based on these partial inputs.
arXiv Detail & Related papers (2024-09-20T18:16:05Z)
Can Watermarking Large Language Models Prevent Copyrighted Text Generation and Hide Training Data? [62.72729485995075]
We investigate the effectiveness of watermarking as a deterrent against the generation of copyrighted texts. We find that watermarking adversely affects the success rate of Membership Inference Attacks (MIAs) We propose an adaptive technique to improve the success rate of a recent MIA under watermarking.
arXiv Detail & Related papers (2024-07-24T16:53:09Z)
Fantastic Copyrighted Beasts and How (Not) to Generate Them [83.77348858322523]
Copyrighted characters pose a difficult challenge for image generation services. At least one lawsuit has been awarded damages based on the generation of these characters.
arXiv Detail & Related papers (2024-06-20T17:38:16Z)
SHIELD: Evaluation and Defense Strategies for Copyright Compliance in LLM Text Generation [24.644101178288476]
Large Language Models (LLMs) have transformed machine learning but raised significant legal concerns. LLMs may infringe on copyrights or overly restrict non-copyrighted texts. We propose lightweight, real-time defense to prevent the generation of copyrighted text.
arXiv Detail & Related papers (2024-06-18T18:00:03Z)
Copyright Protection in Generative AI: A Technical Perspective [58.84343394349887]
Generative AI has witnessed rapid advancement in recent years, expanding their capabilities to create synthesized content such as text, images, audio, and code. The high fidelity and authenticity of contents generated by these Deep Generative Models (DGMs) have sparked significant copyright concerns. This work delves into this issue by providing a comprehensive overview of copyright protection from a technical perspective.
arXiv Detail & Related papers (2024-02-04T04:00:33Z)
Copyright Violations and Large Language Models [10.251605253237491]
This work explores the issue of copyright violations and large language models through the lens of verbatim memorization. We present experiments with a range of language models over a collection of popular books and coding problems. Overall, this research highlights the need for further examination and the potential impact on future developments in natural language processing to ensure adherence to copyright regulations.
arXiv Detail & Related papers (2023-10-20T19:14:59Z)
SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore [159.21914121143885]
We present SILO, a new language model that manages this risk-performance tradeoff during inference. SILO is built by (1) training a parametric LM on Open License Corpus (OLC), a new corpus we curate with 228B tokens of public domain and permissively licensed text. Access to the datastore greatly improves out of domain performance, closing 90% of the performance gap with an LM trained on the Pile.
arXiv Detail & Related papers (2023-08-08T17:58:15Z)
Are You Copying My Model? Protecting the Copyright of Large Language Models for EaaS via Backdoor Watermark [58.60940048748815]
Companies have begun to offer Embedding as a Service (E) based on large language models (LLMs) E is vulnerable to model extraction attacks, which can cause significant losses for the owners of LLMs. We propose an Embedding Watermark method called EmbMarker that implants backdoors on embeddings.
arXiv Detail & Related papers (2023-05-17T08:28:54Z)
Foundation Models and Fair Use [96.04664748698103]
In the U.S. and other countries, copyrighted content may be used to build foundation models without incurring liability due to the fair use doctrine. In this work, we survey the potential risks of developing and deploying foundation models based on copyrighted content. We discuss technical mitigations that can help foundation models stay in line with fair use.
arXiv Detail & Related papers (2023-03-28T03:58:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.