CopyBench: Measuring Literal and Non-Literal Reproduction of Copyright-Protected Text in Language Model Generation
- URL: http://arxiv.org/abs/2407.07087v2
- Date: Fri, 4 Oct 2024 05:35:57 GMT
- Title: CopyBench: Measuring Literal and Non-Literal Reproduction of Copyright-Protected Text in Language Model Generation
- Authors: Tong Chen, Akari Asai, Niloofar Mireshghallah, Sewon Min, James Grimmelmann, Yejin Choi, Hannaneh Hajishirzi, Luke Zettlemoyer, Pang Wei Koh,
- Abstract summary: We introduce CopyBench, a benchmark designed to measure both literal and non-literal copying in LM generations.
We find that, although literal copying is relatively rare, two types of non-literal copying -- event copying and character copying -- occur even in models as small as 7B parameters.
- Score: 132.00910067533982
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Evaluating the degree of reproduction of copyright-protected content by language models (LMs) is of significant interest to the AI and legal communities. Although both literal and non-literal similarities are considered by courts when assessing the degree of reproduction, prior research has focused only on literal similarities. To bridge this gap, we introduce CopyBench, a benchmark designed to measure both literal and non-literal copying in LM generations. Using copyrighted fiction books as text sources, we provide automatic evaluation protocols to assess literal and non-literal copying, balanced against the model utility in terms of the ability to recall facts from the copyrighted works and generate fluent completions. We find that, although literal copying is relatively rare, two types of non-literal copying -- event copying and character copying -- occur even in models as small as 7B parameters. Larger models demonstrate significantly more copying, with literal copying rates increasing from 0.2\% to 10.5\% and non-literal copying from 2.3\% to 5.9\% when comparing Llama3-8B and 70B models, respectively. We further evaluate the effectiveness of current strategies for mitigating copying and show that (1) training-time alignment can reduce literal copying but may increase non-literal copying, and (2) current inference-time mitigation methods primarily reduce literal but not non-literal copying.
Related papers
- CopyLens: Dynamically Flagging Copyrighted Sub-Dataset Contributions to LLM Outputs [39.425944445393945]
We introduce CopyLens, a framework to analyze how copyrighted datasets may influence Large Language Models responses.
Experiments show that CopyLens improves efficiency and accuracy by 15.2% over our proposed baseline, 58.7% over prompt engineering methods, and 0.21 AUC over OOD detection baselines.
arXiv Detail & Related papers (2024-10-06T11:41:39Z) - Language Models "Grok" to Copy [36.50007948478452]
We examine the pre-training dynamics of language models, focusing on their ability to copy text from preceding context.
We propose a novel perspective that Transformer-based language models develop copying abilities similarly to grokking.
We contend that the connection between grokking and context copying can provide valuable insights for more effective language model training.
arXiv Detail & Related papers (2024-09-14T03:11:00Z) - Fantastic Copyrighted Beasts and How (Not) to Generate Them [83.77348858322523]
Copyrighted characters pose a difficult challenge for image generation services.
At least one lawsuit has been awarded damages based on the generation of these characters.
arXiv Detail & Related papers (2024-06-20T17:38:16Z) - BERT-Enhanced Retrieval Tool for Homework Plagiarism Detection System [0.0]
We propose a plagiarized text data generation method based on GPT-3.5, which produces 32,927 pairs of text plagiarism detection datasets.
We also propose a plagiarism identification method based on Faiss with BERT with high efficiency and high accuracy.
Our experiments show that the performance of this model outperforms other models in several metrics, including 98.86%, 98.90%, 98.86%, and 0.9888 for Accuracy, Precision, Recall, and F1 Score.
arXiv Detail & Related papers (2024-04-01T12:20:34Z) - A Dataset and Benchmark for Copyright Infringement Unlearning from Text-to-Image Diffusion Models [52.49582606341111]
Copyright law confers creators the exclusive rights to reproduce, distribute, and monetize their creative works.
Recent progress in text-to-image generation has introduced formidable challenges to copyright enforcement.
We introduce a novel pipeline that harmonizes CLIP, ChatGPT, and diffusion models to curate a dataset.
arXiv Detail & Related papers (2024-01-04T11:14:01Z) - Copy Is All You Need [66.00852205068327]
We formulate text generation as progressively copying text segments from an existing text collection.
Our approach achieves better generation quality according to both automatic and human evaluations.
Our approach attains additional performance gains by simply scaling up to larger text collections.
arXiv Detail & Related papers (2023-07-13T05:03:26Z) - DPIC: Decoupling Prompt and Intrinsic Characteristics for LLM Generated Text Detection [56.513637720967566]
Large language models (LLMs) can generate texts that pose risks of misuse, such as plagiarism, planting fake reviews on e-commerce platforms, or creating inflammatory false tweets.
Existing high-quality detection methods usually require access to the interior of the model to extract the intrinsic characteristics.
We propose to extract deep intrinsic characteristics of the black-box model generated texts.
arXiv Detail & Related papers (2023-05-21T17:26:16Z) - Reproduction and Replication of an Adversarial Stylometry Experiment [8.374836126235499]
This paper reproduces and replicates experiments in a seminal study of defenses against authorship attribution.
We find new evidence suggesting that an entirely automatic method, round-trip translation, merits re-examination.
arXiv Detail & Related papers (2022-08-15T18:24:00Z) - May the Force Be with Your Copy Mechanism: Enhanced Supervised-Copy
Method for Natural Language Generation [1.2453219864236247]
We propose a novel supervised approach of a copy network that helps the model decide which words need to be copied and which need to be generated.
Specifically, we re-define the objective function, which leverages source sequences and target vocabularies as guidance for copying.
The experimental results on data-to-text generation and abstractive summarization tasks verify that our approach enhances the copying quality and improves the degree of abstractness.
arXiv Detail & Related papers (2021-12-20T06:54:28Z) - On the Copying Behaviors of Pre-Training for Neural Machine Translation [63.914940899327966]
Previous studies have shown that initializing neural machine translation (NMT) models with the pre-trained language models (LM) can speed up the model training and boost the model performance.
In this work, we identify a critical side-effect of pre-training for NMT, which is due to the discrepancy between the training objectives of LM-based pre-training and NMT.
We propose a simple and effective method named copying penalty to control the copying behaviors in decoding.
arXiv Detail & Related papers (2021-07-17T10:02:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.