Related papers: Digger: Detecting Copyright Content Mis-usage in Large Language Model Training

Digger: Detecting Copyright Content Mis-usage in Large Language Model Training

URL: http://arxiv.org/abs/2401.00676v1
Date: Mon, 1 Jan 2024 06:04:52 GMT
Title: Digger: Detecting Copyright Content Mis-usage in Large Language Model Training
Authors: Haodong Li, Gelei Deng, Yi Liu, Kailong Wang, Yuekang Li, Tianwei Zhang, Yang Liu, Guoai Xu, Guosheng Xu, Haoyu Wang
Abstract summary: We introduce a framework designed to detect and assess the presence of content from potentially copyrighted books within the training datasets of Large Language Models (LLMs) This framework also provides a confidence estimation for the likelihood of each content sample's inclusion.
Score: 23.99093718956372
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Pre-training, which utilizes extensive and varied datasets, is a critical factor in the success of Large Language Models (LLMs) across numerous applications. However, the detailed makeup of these datasets is often not disclosed, leading to concerns about data security and potential misuse. This is particularly relevant when copyrighted material, still under legal protection, is used inappropriately, either intentionally or unintentionally, infringing on the rights of the authors. In this paper, we introduce a detailed framework designed to detect and assess the presence of content from potentially copyrighted books within the training datasets of LLMs. This framework also provides a confidence estimation for the likelihood of each content sample's inclusion. To validate our approach, we conduct a series of simulated experiments, the results of which affirm the framework's effectiveness in identifying and addressing instances of content misuse in LLM training processes. Furthermore, we investigate the presence of recognizable quotes from famous literary works within these datasets. The outcomes of our study have significant implications for ensuring the ethical use of copyrighted materials in the development of LLMs, highlighting the need for more transparent and responsible data management practices in this field.

Related papers

SUV: Scalable Large Language Model Copyright Compliance with Regularized Selective Unlearning [22.76025238218253]
SUV is a selective unlearning framework designed to prevent Large Language Models from memorizing copyrighted content. We replace verbatim copyrighted content with plausible and coherent alternatives. We validate our approach using a large-scale dataset of 500 famous books.
arXiv Detail & Related papers (2025-03-29T02:33:26Z)
Information-Guided Identification of Training Data Imprint in (Proprietary) Large Language Models [52.439289085318634]
We show how to identify training data known to proprietary large language models (LLMs) by using information-guided probes. Our work builds on a key observation: text passages with high surprisal are good search material for memorization probes.
arXiv Detail & Related papers (2025-03-15T10:19:15Z)
A Statistical Hypothesis Testing Framework for Data Misappropriation Detection in Large Language Models [14.834820135578045]
We focus on a problem of data misappropriation detection, namely, to determine whether a given LLM has incorporated data generated by another LLM. To address this issue, we propose embedding watermarks into the copyrighted training data and formulating the detection of data misappropriation as a hypothesis testing problem.
arXiv Detail & Related papers (2025-01-05T04:47:42Z)
CAP: Detecting Unauthorized Data Usage in Generative Models via Prompt Generation [1.6141139250981018]
Copyright Audit via Prompts generation (CAP) is a framework for automatically testing whether an ML model has been trained with unauthorized data. Specifically, we devise an approach to generate suitable keys inducing the model to reveal copyrighted contents. To prove its effectiveness, we conducted an extensive evaluation campaign on measurements collected in four IoT scenarios.
arXiv Detail & Related papers (2024-10-08T08:49:41Z)
CopyLens: Dynamically Flagging Copyrighted Sub-Dataset Contributions to LLM Outputs [39.425944445393945]
We introduce CopyLens, a framework to analyze how copyrighted datasets may influence Large Language Models responses. Experiments show that CopyLens improves efficiency and accuracy by 15.2% over our proposed baseline, 58.7% over prompt engineering methods, and 0.21 AUC over OOD detection baselines.
arXiv Detail & Related papers (2024-10-06T11:41:39Z)
Can Watermarking Large Language Models Prevent Copyrighted Text Generation and Hide Training Data? [62.72729485995075]
We investigate the effectiveness of watermarking as a deterrent against the generation of copyrighted texts. We find that watermarking adversely affects the success rate of Membership Inference Attacks (MIAs) We propose an adaptive technique to improve the success rate of a recent MIA under watermarking.
arXiv Detail & Related papers (2024-07-24T16:53:09Z)
Evaluating Copyright Takedown Methods for Language Models [100.38129820325497]
Language models (LMs) derive their capabilities from extensive training on diverse data, including potentially copyrighted material. This paper introduces the first evaluation of the feasibility and side effects of copyright takedowns for LMs. We examine several strategies, including adding system prompts, decoding-time filtering interventions, and unlearning approaches.
arXiv Detail & Related papers (2024-06-26T18:09:46Z)
Lazy Data Practices Harm Fairness Research [49.02318458244464]
We present a comprehensive analysis of fair ML datasets, demonstrating how unreflective practices hinder the reach and reliability of algorithmic fairness findings. Our analyses identify three main areas of concern: (1) a textbflack of representation for certain protected attributes in both data and evaluations; (2) the widespread textbf of minorities during data preprocessing; and (3) textbfopaque data processing threatening the generalization of fairness research. This study underscores the need for a critical reevaluation of data practices in fair ML and offers directions to improve both the sourcing and usage of datasets.
arXiv Detail & Related papers (2024-04-26T09:51:24Z)
C-ICL: Contrastive In-context Learning for Information Extraction [54.39470114243744]
c-ICL is a novel few-shot technique that leverages both correct and incorrect sample constructions to create in-context learning demonstrations. Our experiments on various datasets indicate that c-ICL outperforms previous few-shot in-context learning methods.
arXiv Detail & Related papers (2024-02-17T11:28:08Z)
PrivacyMind: Large Language Models Can Be Contextual Privacy Protection Learners [81.571305826793]
We introduce Contextual Privacy Protection Language Models (PrivacyMind) Our work offers a theoretical analysis for model design and benchmarks various techniques. In particular, instruction tuning with both positive and negative examples stands out as a promising method.
arXiv Detail & Related papers (2023-10-03T22:37:01Z)
Source Attribution for Large Language Model-Generated Data [57.85840382230037]
It is imperative to be able to perform source attribution by identifying the data provider who contributed to the generation of a synthetic text. We show that this problem can be tackled by watermarking. We propose a source attribution framework that satisfies these key properties due to our algorithmic designs.
arXiv Detail & Related papers (2023-10-01T12:02:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.