Identifying the Source of Generation for Large Language Models
- URL: http://arxiv.org/abs/2407.12846v1
- Date: Fri, 5 Jul 2024 08:52:15 GMT
- Title: Identifying the Source of Generation for Large Language Models
- Authors: Bumjin Park, Jaesik Choi,
- Abstract summary: Large language models (LLMs) memorize text from several sources of documents.
LLMs can not provide document information on the generated content.
This work introduces token-level source identification in the decoding step.
- Score: 21.919661430250798
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large language models (LLMs) memorize text from several sources of documents. In pretraining, LLM trains to maximize the likelihood of text but neither receives the source of the text nor memorizes the source. Accordingly, LLM can not provide document information on the generated content, and users do not obtain any hint of reliability, which is crucial for factuality or privacy infringement. This work introduces token-level source identification in the decoding step, which maps the token representation to the reference document. We propose a bi-gram source identifier, a multi-layer perceptron with two successive token representations as input for better generalization. We conduct extensive experiments on Wikipedia and PG19 datasets with several LLMs, layer locations, and identifier sizes. The overall results show a possibility of token-level source identifiers for tracing the document, a crucial problem for the safe use of LLMs.
Related papers
- Hiding Text in Large Language Models: Introducing Unconditional Token Forcing Confusion [0.0]
We propose a novel approach to extraction called Unconditional Token Forcing.
We present a method to hide text in such a way as it is resistant to Unconditional Token Forcing.
arXiv Detail & Related papers (2024-06-04T16:49:06Z) - Detecting Hallucinations in Large Language Model Generation: A Token Probability Approach [0.0]
Large Language Models (LLMs) produce inaccurate outputs, also known as hallucinations.
This paper introduces a supervised learning approach employing only four numerical features derived from tokens and vocabulary probabilities obtained from other evaluators.
The method yields promising results, surpassing state-of-the-art outcomes in multiple tasks across three different benchmarks.
arXiv Detail & Related papers (2024-05-30T03:00:47Z) - Generative Text Steganography with Large Language Model [10.572149957139736]
Black-box generative text steganographic method based on user interfaces of large language models, which is called LLM-Stega.
We first construct a keyword set and design a new encrypted steganographic mapping to embed secret messages.
Comprehensive experiments demonstrate that the proposed LLM-Stega outperforms current state-of-the-art methods.
arXiv Detail & Related papers (2024-04-16T02:19:28Z) - Ghost Sentence: A Tool for Everyday Users to Copyright Data from Large Language Models [55.321010757641524]
Web user data plays a central role in the ecosystem of pre-trained large language models (LLMs)
In this work, we suggest that users repeatedly insert personal passphrases into their documents.
Once they are identified in the generated content of LLMs, users can be sure that their data is used for training.
arXiv Detail & Related papers (2024-03-23T06:36:32Z) - Can Large Language Models Recall Reference Location Like Humans? [8.657708519922002]
This paper explores leveraging the parameterized knowledge stored during the pre-training phase of large language models to independently recall reference passage from any starting position.
Experiments on KILT knowledge-sensitive tasks have verified that LLMs can independently recall reference passage location in various task forms.
arXiv Detail & Related papers (2024-02-26T20:35:32Z) - LLsM: Generative Linguistic Steganography with Large Language Model [10.72286166021398]
Linguistic Steganography (LS) tasks aim to generate steganographic text (stego) based on secret information.
Existing LS methods do not consider the controllable generation of stegos containing specific discourses.
This paper proposes the LLsM, the first LS work with the Large Language Model (LLM)
arXiv Detail & Related papers (2024-01-28T13:21:44Z) - LLatrieval: LLM-Verified Retrieval for Verifiable Generation [67.93134176912477]
Verifiable generation aims to let the large language model (LLM) generate text with supporting documents.
We propose LLatrieval (Large Language Model Verified Retrieval), where the LLM updates the retrieval result until it verifies that the retrieved documents can sufficiently support answering the question.
Experiments show that LLatrieval significantly outperforms extensive baselines and achieves state-of-the-art results.
arXiv Detail & Related papers (2023-11-14T01:38:02Z) - WASA: WAtermark-based Source Attribution for Large Language
Model-Generated Data [60.759755177369364]
Large language models (LLMs) generate synthetic texts with embedded watermarks that contain information about their source(s)
We propose a WAtermarking for Source Attribution (WASA) framework that satisfies key properties due to our algorithmic designs.
Our framework achieves effective source attribution and data provenance.
arXiv Detail & Related papers (2023-10-01T12:02:57Z) - Towards Codable Watermarking for Injecting Multi-bits Information to LLMs [86.86436777626959]
Large language models (LLMs) generate texts with increasing fluency and realism.
Existing watermarking methods are encoding-inefficient and cannot flexibly meet the diverse information encoding needs.
We propose Codable Text Watermarking for LLMs (CTWL) that allows text watermarks to carry multi-bit customizable information.
arXiv Detail & Related papers (2023-07-29T14:11:15Z) - LLMDet: A Third Party Large Language Models Generated Text Detection
Tool [119.0952092533317]
Large language models (LLMs) are remarkably close to high-quality human-authored text.
Existing detection tools can only differentiate between machine-generated and human-authored text.
We propose LLMDet, a model-specific, secure, efficient, and extendable detection tool.
arXiv Detail & Related papers (2023-05-24T10:45:16Z) - Inference with Reference: Lossless Acceleration of Large Language Models [97.04200102556551]
LLMA is an accelerator to speed up Large Language Model (LLM) inference with references.
It is motivated by the observation that there are abundant identical text spans between the decoding result by an LLM and the reference that is available in many real world scenarios.
arXiv Detail & Related papers (2023-04-10T09:55:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.