Digger: Detecting Copyright Content Mis-usage in Large Language Model
Training
- URL: http://arxiv.org/abs/2401.00676v1
- Date: Mon, 1 Jan 2024 06:04:52 GMT
- Title: Digger: Detecting Copyright Content Mis-usage in Large Language Model
Training
- Authors: Haodong Li, Gelei Deng, Yi Liu, Kailong Wang, Yuekang Li, Tianwei
Zhang, Yang Liu, Guoai Xu, Guosheng Xu, Haoyu Wang
- Abstract summary: We introduce a framework designed to detect and assess the presence of content from potentially copyrighted books within the training datasets of Large Language Models (LLMs)
This framework also provides a confidence estimation for the likelihood of each content sample's inclusion.
- Score: 23.99093718956372
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-training, which utilizes extensive and varied datasets, is a critical
factor in the success of Large Language Models (LLMs) across numerous
applications. However, the detailed makeup of these datasets is often not
disclosed, leading to concerns about data security and potential misuse. This
is particularly relevant when copyrighted material, still under legal
protection, is used inappropriately, either intentionally or unintentionally,
infringing on the rights of the authors.
In this paper, we introduce a detailed framework designed to detect and
assess the presence of content from potentially copyrighted books within the
training datasets of LLMs. This framework also provides a confidence estimation
for the likelihood of each content sample's inclusion. To validate our
approach, we conduct a series of simulated experiments, the results of which
affirm the framework's effectiveness in identifying and addressing instances of
content misuse in LLM training processes. Furthermore, we investigate the
presence of recognizable quotes from famous literary works within these
datasets. The outcomes of our study have significant implications for ensuring
the ethical use of copyrighted materials in the development of LLMs,
highlighting the need for more transparent and responsible data management
practices in this field.
Related papers
- Can Watermarking Large Language Models Prevent Copyrighted Text Generation and Hide Training Data? [62.72729485995075]
We investigate the effectiveness of watermarking as a deterrent against the generation of copyrighted texts.
We find that watermarking adversely affects the success rate of Membership Inference Attacks (MIAs)
We propose an adaptive technique to improve the success rate of a recent MIA under watermarking.
arXiv Detail & Related papers (2024-07-24T16:53:09Z) - Evaluating Copyright Takedown Methods for Language Models [100.38129820325497]
Language models (LMs) derive their capabilities from extensive training on diverse data, including potentially copyrighted material.
This paper introduces the first evaluation of the feasibility and side effects of copyright takedowns for LMs.
We examine several strategies, including adding system prompts, decoding-time filtering interventions, and unlearning approaches.
arXiv Detail & Related papers (2024-06-26T18:09:46Z) - Lazy Data Practices Harm Fairness Research [49.02318458244464]
We present a comprehensive analysis of fair ML datasets, demonstrating how unreflective practices hinder the reach and reliability of algorithmic fairness findings.
Our analyses identify three main areas of concern: (1) a textbflack of representation for certain protected attributes in both data and evaluations; (2) the widespread textbf of minorities during data preprocessing; and (3) textbfopaque data processing threatening the generalization of fairness research.
This study underscores the need for a critical reevaluation of data practices in fair ML and offers directions to improve both the sourcing and usage of datasets.
arXiv Detail & Related papers (2024-04-26T09:51:24Z) - C-ICL: Contrastive In-context Learning for Information Extraction [54.39470114243744]
c-ICL is a novel few-shot technique that leverages both correct and incorrect sample constructions to create in-context learning demonstrations.
Our experiments on various datasets indicate that c-ICL outperforms previous few-shot in-context learning methods.
arXiv Detail & Related papers (2024-02-17T11:28:08Z) - A Dataset and Benchmark for Copyright Infringement Unlearning from Text-to-Image Diffusion Models [52.49582606341111]
Copyright law confers creators the exclusive rights to reproduce, distribute, and monetize their creative works.
Recent progress in text-to-image generation has introduced formidable challenges to copyright enforcement.
We introduce a novel pipeline that harmonizes CLIP, ChatGPT, and diffusion models to curate a dataset.
arXiv Detail & Related papers (2024-01-04T11:14:01Z) - Breaking the Silence: the Threats of Using LLMs in Software Engineering [12.368546216271382]
Large Language Models (LLMs) have gained considerable traction within the Software Engineering (SE) community.
This paper initiates an open discussion on potential threats to the validity of LLM-based research.
arXiv Detail & Related papers (2023-12-13T11:02:19Z) - WASA: WAtermark-based Source Attribution for Large Language
Model-Generated Data [60.759755177369364]
Large language models (LLMs) generate synthetic texts with embedded watermarks that contain information about their source(s)
We propose a WAtermarking for Source Attribution (WASA) framework that satisfies key properties due to our algorithmic designs.
Our framework achieves effective source attribution and data provenance.
arXiv Detail & Related papers (2023-10-01T12:02:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.