Measuring Copyright Risks of Large Language Model via Partial Information Probing
- URL: http://arxiv.org/abs/2409.13831v1
- Date: Fri, 20 Sep 2024 18:16:05 GMT
- Title: Measuring Copyright Risks of Large Language Model via Partial Information Probing
- Authors: Weijie Zhao, Huajie Shao, Zhaozhuo Xu, Suzhen Duan, Denghui Zhang,
- Abstract summary: We explore the data sources used to train Large Language Models (LLMs)
We input a portion of a copyrighted text into LLMs, prompt them to complete it, and then analyze the overlap between the generated content and the original copyrighted material.
Our findings demonstrate that LLMs can indeed generate content highly overlapping with copyrighted materials based on these partial inputs.
- Score: 14.067687792633372
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Exploring the data sources used to train Large Language Models (LLMs) is a crucial direction in investigating potential copyright infringement by these models. While this approach can identify the possible use of copyrighted materials in training data, it does not directly measure infringing risks. Recent research has shifted towards testing whether LLMs can directly output copyrighted content. Addressing this direction, we investigate and assess LLMs' capacity to generate infringing content by providing them with partial information from copyrighted materials, and try to use iterative prompting to get LLMs to generate more infringing content. Specifically, we input a portion of a copyrighted text into LLMs, prompt them to complete it, and then analyze the overlap between the generated content and the original copyrighted material. Our findings demonstrate that LLMs can indeed generate content highly overlapping with copyrighted materials based on these partial inputs.
Related papers
- Do LLMs Know to Respect Copyright Notice? [11.14140288980773]
We investigate whether language models infringe upon copyrights when processing user input containing protected material.
Our study offers a conservative evaluation of the extent to which language models may infringe upon copyrights.
This research emphasizes the need for further investigation and the importance of ensuring LLMs respect copyright regulations.
arXiv Detail & Related papers (2024-11-02T04:45:21Z) - Evaluation of Attribution Bias in Retrieval-Augmented Large Language Models [47.694137341509304]
We evaluate the attribution sensitivity and bias with respect to authorship information in large language models.
Our results show that adding authorship information to source documents can significantly change the attribution quality of LLMs by 3% to 18%.
Our findings indicate that metadata of source documents can influence LLMs' trust, and how they attribute their answers.
arXiv Detail & Related papers (2024-10-16T08:55:49Z) - CopyLens: Dynamically Flagging Copyrighted Sub-Dataset Contributions to LLM Outputs [39.425944445393945]
We introduce CopyLens, a framework to analyze how copyrighted datasets may influence Large Language Models responses.
Experiments show that CopyLens improves efficiency and accuracy by 15.2% over our proposed baseline, 58.7% over prompt engineering methods, and 0.21 AUC over OOD detection baselines.
arXiv Detail & Related papers (2024-10-06T11:41:39Z) - Can Watermarking Large Language Models Prevent Copyrighted Text Generation and Hide Training Data? [62.72729485995075]
We investigate the effectiveness of watermarking as a deterrent against the generation of copyrighted texts.
We find that watermarking adversely affects the success rate of Membership Inference Attacks (MIAs)
We propose an adaptive technique to improve the success rate of a recent MIA under watermarking.
arXiv Detail & Related papers (2024-07-24T16:53:09Z) - Evaluating Copyright Takedown Methods for Language Models [100.38129820325497]
Language models (LMs) derive their capabilities from extensive training on diverse data, including potentially copyrighted material.
This paper introduces the first evaluation of the feasibility and side effects of copyright takedowns for LMs.
We examine several strategies, including adding system prompts, decoding-time filtering interventions, and unlearning approaches.
arXiv Detail & Related papers (2024-06-26T18:09:46Z) - LLMs and Memorization: On Quality and Specificity of Copyright Compliance [0.0]
Memorization in large language models (LLMs) is a growing concern.
LLMs have been shown to easily reproduce parts of their training data, including copyrighted work.
This is an important problem to solve, as it may violate existing copyright laws as well as the European AI Act.
arXiv Detail & Related papers (2024-05-28T18:01:52Z) - LLatrieval: LLM-Verified Retrieval for Verifiable Generation [67.93134176912477]
Verifiable generation aims to let the large language model (LLM) generate text with supporting documents.
We propose LLatrieval (Large Language Model Verified Retrieval), where the LLM updates the retrieval result until it verifies that the retrieved documents can sufficiently support answering the question.
Experiments show that LLatrieval significantly outperforms extensive baselines and achieves state-of-the-art results.
arXiv Detail & Related papers (2023-11-14T01:38:02Z) - Copyright Violations and Large Language Models [10.251605253237491]
This work explores the issue of copyright violations and large language models through the lens of verbatim memorization.
We present experiments with a range of language models over a collection of popular books and coding problems.
Overall, this research highlights the need for further examination and the potential impact on future developments in natural language processing to ensure adherence to copyright regulations.
arXiv Detail & Related papers (2023-10-20T19:14:59Z) - Source Attribution for Large Language Model-Generated Data [57.85840382230037]
It is imperative to be able to perform source attribution by identifying the data provider who contributed to the generation of a synthetic text.
We show that this problem can be tackled by watermarking.
We propose a source attribution framework that satisfies these key properties due to our algorithmic designs.
arXiv Detail & Related papers (2023-10-01T12:02:57Z) - Are You Copying My Model? Protecting the Copyright of Large Language
Models for EaaS via Backdoor Watermark [58.60940048748815]
Companies have begun to offer Embedding as a Service (E) based on large language models (LLMs)
E is vulnerable to model extraction attacks, which can cause significant losses for the owners of LLMs.
We propose an Embedding Watermark method called EmbMarker that implants backdoors on embeddings.
arXiv Detail & Related papers (2023-05-17T08:28:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.