Watermarking Makes Language Models Radioactive
- URL: http://arxiv.org/abs/2402.14904v1
- Date: Thu, 22 Feb 2024 18:55:22 GMT
- Title: Watermarking Makes Language Models Radioactive
- Authors: Tom Sander, Pierre Fernandez, Alain Durmus, Matthijs Douze, Teddy
Furon
- Abstract summary: We show that watermarked training data leaves traces easier to detect and much more reliable than membership inference.
We notably demonstrate that training on watermarked synthetic instructions can be detected with high confidence.
- Score: 25.33316874135086
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper investigates the radioactivity of LLM-generated texts, i.e.
whether it is possible to detect that such input was used as training data.
Conventional methods like membership inference can carry out this detection
with some level of accuracy. We show that watermarked training data leaves
traces easier to detect and much more reliable than membership inference. We
link the contamination level to the watermark robustness, its proportion in the
training set, and the fine-tuning process. We notably demonstrate that training
on watermarked synthetic instructions can be detected with high confidence
(p-value < 1e-5) even when as little as 5% of training text is watermarked.
Thus, LLM watermarking, originally designed for detecting machine-generated
text, gives the ability to easily identify if the outputs of a watermarked LLM
were used to fine-tune another LLM.
Related papers
- LexiMark: Robust Watermarking via Lexical Substitutions to Enhance Membership Verification of an LLM's Textual Training Data [24.312198733476063]
LexiMark is a novel watermarking technique designed for text and documents.<n>It embeds synonym substitutions for carefully selected high-entropy words.<n>It is resistant to removal due to its subtle, contextually appropriate substitutions.
arXiv Detail & Related papers (2025-06-17T12:41:53Z) - GaussMark: A Practical Approach for Structural Watermarking of Language Models [61.84270985214254]
GaussMark is a simple, efficient, and relatively robust scheme for watermarking large language models.
We show that GaussMark is reliable, efficient, and relatively robust to corruptions such as insertions, deletions, substitutions, and roundtrip translations.
arXiv Detail & Related papers (2025-01-17T22:30:08Z) - Robust Detection of Watermarks for Large Language Models Under Human Edits [27.678152860666163]
We introduce a new method in the form of a truncated goodness-of-fit test for detecting watermarked text under human edits.
We prove that the Tr-GoF test achieves optimality in robust detection of the Gumbel-GoF watermark.
We also show that the Tr-GoF test attains the highest detection efficiency rate in a certain regime of moderate text modifications.
arXiv Detail & Related papers (2024-11-21T06:06:04Z) - Signal Watermark on Large Language Models [28.711745671275477]
We propose a watermarking method embedding a specific watermark into the text during its generation by Large Language Models (LLMs)
This technique not only ensures the watermark's invisibility to humans but also maintains the quality and grammatical integrity of model-generated text.
Our method has been empirically validated across multiple LLMs, consistently maintaining high detection accuracy.
arXiv Detail & Related papers (2024-10-09T04:49:03Z) - A Statistical Framework of Watermarks for Large Language Models: Pivot, Detection Efficiency and Optimal Rules [27.678152860666163]
We introduce a framework for reasoning about the statistical efficiency of watermarks and powerful detection rules.
We derive optimal detection rules for watermarks under our framework.
arXiv Detail & Related papers (2024-04-01T17:03:41Z) - Learning to Watermark LLM-generated Text via Reinforcement Learning [16.61005372279407]
We study how to watermark LLM outputs to track misuse.
We design a model-level watermark that embeds signals into the output.
We propose a co-training framework based on reinforcement learning.
arXiv Detail & Related papers (2024-03-13T03:43:39Z) - Towards Codable Watermarking for Injecting Multi-bits Information to LLMs [86.86436777626959]
Large language models (LLMs) generate texts with increasing fluency and realism.
Existing watermarking methods are encoding-inefficient and cannot flexibly meet the diverse information encoding needs.
We propose Codable Text Watermarking for LLMs (CTWL) that allows text watermarks to carry multi-bit customizable information.
arXiv Detail & Related papers (2023-07-29T14:11:15Z) - On the Reliability of Watermarks for Large Language Models [95.87476978352659]
We study the robustness of watermarked text after it is re-written by humans, paraphrased by a non-watermarked LLM, or mixed into a longer hand-written document.
We find that watermarks remain detectable even after human and machine paraphrasing.
We also consider a range of new detection schemes that are sensitive to short spans of watermarked text embedded inside a large document.
arXiv Detail & Related papers (2023-06-07T17:58:48Z) - Watermarking Text Generated by Black-Box Language Models [103.52541557216766]
A watermark-based method was proposed for white-box LLMs, allowing them to embed watermarks during text generation.
A detection algorithm aware of the list can identify the watermarked text.
We develop a watermarking framework for black-box language model usage scenarios.
arXiv Detail & Related papers (2023-05-14T07:37:33Z) - Can AI-Generated Text be Reliably Detected? [54.670136179857344]
Unregulated use of LLMs can potentially lead to malicious consequences such as plagiarism, generating fake news, spamming, etc.
Recent works attempt to tackle this problem either using certain model signatures present in the generated text outputs or by applying watermarking techniques.
In this paper, we show that these detectors are not reliable in practical scenarios.
arXiv Detail & Related papers (2023-03-17T17:53:19Z) - DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability
Curvature [143.5381108333212]
We show that text sampled from an large language model tends to occupy negative curvature regions of the model's log probability function.
We then define a new curvature-based criterion for judging if a passage is generated from a given LLM.
We find DetectGPT is more discriminative than existing zero-shot methods for model sample detection.
arXiv Detail & Related papers (2023-01-26T18:44:06Z) - A Watermark for Large Language Models [84.95327142027183]
We propose a watermarking framework for proprietary language models.
The watermark can be embedded with negligible impact on text quality.
It can be detected using an efficient open-source algorithm without access to the language model API or parameters.
arXiv Detail & Related papers (2023-01-24T18:52:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.