Ghost Sentence: A Tool for Everyday Users to Copyright Data from Large Language Models
- URL: http://arxiv.org/abs/2403.15740v1
- Date: Sat, 23 Mar 2024 06:36:32 GMT
- Title: Ghost Sentence: A Tool for Everyday Users to Copyright Data from Large Language Models
- Authors: Shuai Zhao, Linchao Zhu, Ruijie Quan, Yi Yang,
- Abstract summary: Web user data plays a central role in the ecosystem of pre-trained large language models (LLMs)
In this work, we suggest that users repeatedly insert personal passphrases into their documents.
Once they are identified in the generated content of LLMs, users can be sure that their data is used for training.
- Score: 55.321010757641524
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Web user data plays a central role in the ecosystem of pre-trained large language models (LLMs) and their fine-tuned variants. Billions of data are crawled from the web and fed to LLMs. How can \textit{\textbf{everyday web users}} confirm if LLMs misuse their data without permission? In this work, we suggest that users repeatedly insert personal passphrases into their documents, enabling LLMs to memorize them. These concealed passphrases in user documents, referred to as \textit{ghost sentences}, once they are identified in the generated content of LLMs, users can be sure that their data is used for training. To explore the effectiveness and usage of this copyrighting tool, we define the \textit{user training data identification} task with ghost sentences. Multiple datasets from various sources at different scales are created and tested with LLMs of different sizes. For evaluation, we introduce a last $k$ words verification manner along with two metrics: document and user identification accuracy. In the specific case of instruction tuning of a 3B LLaMA model, 11 out of 16 users with ghost sentences identify their data within the generation content. These 16 users contribute 383 examples to $\sim$1.8M training documents. For continuing pre-training of a 1.1B TinyLlama model, 61 out of 64 users with ghost sentences identify their data within the LLM output. These 64 users contribute 1156 examples to $\sim$10M training documents.
Related papers
- Identifying the Source of Generation for Large Language Models [21.919661430250798]
Large language models (LLMs) memorize text from several sources of documents.
LLMs can not provide document information on the generated content.
This work introduces token-level source identification in the decoding step.
arXiv Detail & Related papers (2024-07-05T08:52:15Z) - Special Characters Attack: Toward Scalable Training Data Extraction From Large Language Models [36.58320580210008]
We show that certain special characters or their combinations with English letters are stronger memory triggers, leading to more severe data leakage.
We propose a simple but effective Special Characters Attack (SCA) to induce training data leakage.
arXiv Detail & Related papers (2024-05-09T02:35:32Z) - MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents [62.02920842630234]
We show how to build small models that have GPT-4-level performance but for 400x lower cost.
We unify pre-existing datasets into a benchmark LLM-AggreFact.
Our best system MiniCheck-FT5 (770M parameters) outperforms all systems of comparable size and reaches GPT-4 accuracy.
arXiv Detail & Related papers (2024-04-16T17:59:10Z) - CodecLM: Aligning Language Models with Tailored Synthetic Data [51.59223474427153]
We introduce CodecLM, a framework for adaptively generating high-quality synthetic data for instruction-following abilities.
We first encode seed instructions into metadata, which are concise keywords generated on-the-fly to capture the target instruction distribution.
We also introduce Self-Rubrics and Contrastive Filtering during decoding to tailor data-efficient samples.
arXiv Detail & Related papers (2024-04-08T21:15:36Z) - Elephants Never Forget: Testing Language Models for Memorization of
Tabular Data [21.912611415307644]
Large Language Models (LLMs) can be applied to a diverse set of tasks, but the critical issues of data contamination and memorization are often glossed over.
We introduce a variety of different techniques to assess the degrees of contamination, including statistical tests for conditional distribution modeling and four tests that identify memorization.
arXiv Detail & Related papers (2024-03-11T12:07:13Z) - User Modeling in the Era of Large Language Models: Current Research and
Future Directions [26.01029236902786]
User modeling (UM) aims to discover patterns or learn representations from user data about a specific user.
Two common types of user data are text and graph, as the data usually contain a large amount of user-generated content (UGC) and online interactions.
Recently, large language models (LLMs) have shown superior performance on generating, understanding, and even reasoning over text data.
arXiv Detail & Related papers (2023-12-11T03:59:36Z) - LLatrieval: LLM-Verified Retrieval for Verifiable Generation [67.93134176912477]
Verifiable generation aims to let the large language model (LLM) generate text with supporting documents.
We propose LLatrieval (Large Language Model Verified Retrieval), where the LLM updates the retrieval result until it verifies that the retrieved documents can sufficiently support answering the question.
Experiments show that LLatrieval significantly outperforms extensive baselines and achieves state-of-the-art results.
arXiv Detail & Related papers (2023-11-14T01:38:02Z) - CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large
Language Models in 167 Languages [86.90220551111096]
Training datasets for large language models (LLMs) are often not fully disclosed.
We present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages.
arXiv Detail & Related papers (2023-09-17T23:49:10Z) - AnnoLLM: Making Large Language Models to Be Better Crowdsourced Annotators [98.11286353828525]
GPT-3.5 series models have demonstrated remarkable few-shot and zero-shot ability across various NLP tasks.
We propose AnnoLLM, which adopts a two-step approach, explain-then-annotate.
We build the first conversation-based information retrieval dataset employing AnnoLLM.
arXiv Detail & Related papers (2023-03-29T17:03:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.