Watermarking Makes Language Models Radioactive
- URL: http://arxiv.org/abs/2402.14904v1
- Date: Thu, 22 Feb 2024 18:55:22 GMT
- Title: Watermarking Makes Language Models Radioactive
- Authors: Tom Sander, Pierre Fernandez, Alain Durmus, Matthijs Douze, Teddy
Furon
- Abstract summary: We show that watermarked training data leaves traces easier to detect and much more reliable than membership inference.
We notably demonstrate that training on watermarked synthetic instructions can be detected with high confidence.
- Score: 25.33316874135086
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper investigates the radioactivity of LLM-generated texts, i.e.
whether it is possible to detect that such input was used as training data.
Conventional methods like membership inference can carry out this detection
with some level of accuracy. We show that watermarked training data leaves
traces easier to detect and much more reliable than membership inference. We
link the contamination level to the watermark robustness, its proportion in the
training set, and the fine-tuning process. We notably demonstrate that training
on watermarked synthetic instructions can be detected with high confidence
(p-value < 1e-5) even when as little as 5% of training text is watermarked.
Thus, LLM watermarking, originally designed for detecting machine-generated
text, gives the ability to easily identify if the outputs of a watermarked LLM
were used to fine-tune another LLM.
Related papers
- Less is More: Sparse Watermarking in LLMs with Enhanced Text Quality [27.592486717044455]
We present a novel type of watermark, Sparse Watermark, which aims to mitigate this trade-off by applying watermarks to a small subset of generated tokens distributed across the text.
Our experimental results demonstrate that the proposed watermarking scheme achieves high detectability while generating text that outperforms previous watermarking methods in quality across various tasks.
arXiv Detail & Related papers (2024-07-17T18:52:12Z) - Learning to Watermark LLM-generated Text via Reinforcement Learning [16.61005372279407]
We study how to watermark LLM outputs to track misuse.
We design a model-level watermark that embeds signals into the output.
We propose a co-training framework based on reinforcement learning.
arXiv Detail & Related papers (2024-03-13T03:43:39Z) - On the Learnability of Watermarks for Language Models [80.97358663708592]
We ask whether language models can directly learn to generate watermarked text.
We propose watermark distillation, which trains a student model to behave like a teacher model.
We find that models can learn to generate watermarked text with high detectability.
arXiv Detail & Related papers (2023-12-07T17:41:44Z) - I Know You Did Not Write That! A Sampling Based Watermarking Method for
Identifying Machine Generated Text [0.0]
We propose a new watermarking method to detect machine-generated texts.
Our method embeds a unique pattern within the generated text.
We show how watermarking affects textual quality and compare our proposed method with a state-of-the-art watermarking method.
arXiv Detail & Related papers (2023-11-29T20:04:57Z) - WatME: Towards Lossless Watermarking Through Lexical Redundancy [58.61972059246715]
This study assesses the impact of watermarking on different capabilities of large language models (LLMs) from a cognitive science lens.
We introduce Watermarking with Mutual Exclusion (WatME) to seamlessly integrate watermarks.
arXiv Detail & Related papers (2023-11-16T11:58:31Z) - Turning Your Strength into Watermark: Watermarking Large Language Model via Knowledge Injection [66.26348985345776]
We propose a novel watermarking method for large language models (LLMs) based on knowledge injection.
In the watermark embedding stage, we first embed the watermarks into the selected knowledge to obtain the watermarked knowledge.
In the watermark extraction stage, questions related to the watermarked knowledge are designed, for querying the suspect LLM.
Experiments show that the watermark extraction success rate is close to 100% and demonstrate the effectiveness, fidelity, stealthiness, and robustness of our proposed method.
arXiv Detail & Related papers (2023-11-16T03:22:53Z) - Towards Codable Watermarking for Injecting Multi-bits Information to LLMs [86.86436777626959]
Large language models (LLMs) generate texts with increasing fluency and realism.
Existing watermarking methods are encoding-inefficient and cannot flexibly meet the diverse information encoding needs.
We propose Codable Text Watermarking for LLMs (CTWL) that allows text watermarks to carry multi-bit customizable information.
arXiv Detail & Related papers (2023-07-29T14:11:15Z) - On the Reliability of Watermarks for Large Language Models [95.87476978352659]
We study the robustness of watermarked text after it is re-written by humans, paraphrased by a non-watermarked LLM, or mixed into a longer hand-written document.
We find that watermarks remain detectable even after human and machine paraphrasing.
We also consider a range of new detection schemes that are sensitive to short spans of watermarked text embedded inside a large document.
arXiv Detail & Related papers (2023-06-07T17:58:48Z) - A Watermark for Large Language Models [84.95327142027183]
We propose a watermarking framework for proprietary language models.
The watermark can be embedded with negligible impact on text quality.
It can be detected using an efficient open-source algorithm without access to the language model API or parameters.
arXiv Detail & Related papers (2023-01-24T18:52:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.