Black-Box Detection of Language Model Watermarks
- URL: http://arxiv.org/abs/2405.20777v2
- Date: Sat, 13 Jul 2024 15:47:35 GMT
- Title: Black-Box Detection of Language Model Watermarks
- Authors: Thibaud Gloaguen, Nikola Jovanović, Robin Staab, Martin Vechev,
- Abstract summary: We develop rigorous statistical tests to detect the presence of all three most popular watermarking scheme families using only a limited number of black-box queries.
Our findings indicate that current watermarking schemes are more detectable than previously believed, and that obscuring the fact that a watermark was deployed may not be a viable way for providers to protect against adversaries.
- Score: 1.9374282535132377
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Watermarking has emerged as a promising way to detect LLM-generated text. To apply a watermark an LLM provider, given a secret key, augments generations with a signal that is later detectable by any party with the same key. Recent work has proposed three main families of watermarking schemes, two of which focus on the property of preserving the LLM distribution. This is motivated by it being a tractable proxy for maintaining LLM capabilities, but also by the idea that concealing a watermark deployment makes it harder for malicious actors to hide misuse by avoiding a certain LLM or attacking its watermark. Yet, despite much discourse around detectability, no prior work has investigated if any of these scheme families are detectable in a realistic black-box setting. We tackle this for the first time, developing rigorous statistical tests to detect the presence of all three most popular watermarking scheme families using only a limited number of black-box queries. We experimentally confirm the effectiveness of our methods on a range of schemes and a diverse set of open-source models. Our findings indicate that current watermarking schemes are more detectable than previously believed, and that obscuring the fact that a watermark was deployed may not be a viable way for providers to protect against adversaries. We further apply our methods to test for watermark presence behind the most popular public APIs: GPT4, Claude 3, Gemini 1.0 Pro, finding no strong evidence of a watermark at this point in time.
Related papers
- Entropy-Guided Watermarking for LLMs: A Test-Time Framework for Robust and Traceable Text Generation [58.85645136534301]
Existing watermarking schemes for sampled text often face trade-offs between maintaining text quality and ensuring robust detection against various attacks.
We propose a novel watermarking scheme that improves both detectability and text quality by introducing a cumulative watermark entropy threshold.
arXiv Detail & Related papers (2025-04-16T14:16:38Z) - Toward Breaking Watermarks in Distortion-free Large Language Models [11.922206306917435]
We show that it is possible to "compromise" the LLM and carry out a "spoofing" attack.
Specifically, we propose a mixed integer linear programming framework that accurately estimates the secret key used for watermarking.
arXiv Detail & Related papers (2025-02-25T19:52:55Z) - NSmark: Null Space Based Black-box Watermarking Defense Framework for Pre-trained Language Models [24.864736672581937]
We propose a task-agnostic, black-box watermarking scheme capable of resisting LL-LFEA attacks.
NSmark consists of three phases: (i) watermark generation using the digital signature of the owner, enhanced by spread spectrum modulation for increased robustness; (ii) watermark embedding through an output mapping extractor that preserves PLM performance while maximizing watermark capacity; (iii) watermark verification, assessed by extraction rate and null space conformity.
arXiv Detail & Related papers (2024-10-16T14:45:27Z) - Can Watermarked LLMs be Identified by Users via Crafted Prompts? [55.460327393792156]
This work is the first to investigate the imperceptibility of watermarked Large Language Models (LLMs)
We design an identification algorithm called Water-Probe that detects watermarks through well-designed prompts.
Experiments show that almost all mainstream watermarking algorithms are easily identified with our well-designed prompts.
arXiv Detail & Related papers (2024-10-04T06:01:27Z) - WaterSeeker: Pioneering Efficient Detection of Watermarked Segments in Large Documents [63.563031923075066]
WaterSeeker is a novel approach to efficiently detect and locate watermarked segments amid extensive natural text.
It achieves a superior balance between detection accuracy and computational efficiency.
arXiv Detail & Related papers (2024-09-08T14:45:47Z) - PostMark: A Robust Blackbox Watermark for Large Language Models [56.63560134428716]
We develop PostMark, a modular post-hoc watermarking procedure.
PostMark does not require logit access, which means it can be implemented by a third party.
We show that PostMark is more robust to paraphrasing attacks than existing watermarking methods.
arXiv Detail & Related papers (2024-06-20T17:27:14Z) - Large Language Model Watermark Stealing With Mixed Integer Programming [51.336009662771396]
Large Language Model (LLM) watermark shows promise in addressing copyright, monitoring AI-generated text, and preventing its misuse.
Recent research indicates that watermarking methods using numerous keys are susceptible to removal attacks.
We propose a novel green list stealing attack against the state-of-the-art LLM watermark scheme.
arXiv Detail & Related papers (2024-05-30T04:11:17Z) - Watermarking Low-entropy Generation for Large Language Models: An Unbiased and Low-risk Method [6.505831742654826]
STA-1 is an unbiased watermark that preserves the original token distribution in expectation.
Experimental results on low-entropy and high-entropy datasets demonstrate that STA-1 achieves the above properties simultaneously.
arXiv Detail & Related papers (2024-05-23T14:17:29Z) - A Statistical Framework of Watermarks for Large Language Models: Pivot, Detection Efficiency and Optimal Rules [27.678152860666163]
We introduce a framework for reasoning about the statistical efficiency of watermarks and powerful detection rules.
We derive optimal detection rules for watermarks under our framework.
arXiv Detail & Related papers (2024-04-01T17:03:41Z) - Learning to Watermark LLM-generated Text via Reinforcement Learning [16.61005372279407]
We study how to watermark LLM outputs to track misuse.
We design a model-level watermark that embeds signals into the output.
We propose a co-training framework based on reinforcement learning.
arXiv Detail & Related papers (2024-03-13T03:43:39Z) - Watermark Stealing in Large Language Models [2.1165011830664673]
We show that querying the API of the watermarked LLM to approximately reverse-engineer a watermark enables practical spoofing attacks.
We are the first to propose an automated WS algorithm and use it in the first comprehensive study of spoofing and scrubbing in realistic settings.
arXiv Detail & Related papers (2024-02-29T17:12:39Z) - Token-Specific Watermarking with Enhanced Detectability and Semantic Coherence for Large Language Models [31.062753031312006]
Large language models generate high-quality responses with potential misinformation.
Watermarking is pivotal in this context, which involves embedding hidden markers in texts.
We introduce a novel multi-objective optimization (MOO) approach for watermarking.
Our method simultaneously achieves detectability and semantic integrity.
arXiv Detail & Related papers (2024-02-28T05:43:22Z) - Proving membership in LLM pretraining data via data watermarks [20.57538940552033]
This work proposes using data watermarks to enable principled detection with only black-box model access.
We study two watermarks: one that inserts random sequences, and another that randomly substitutes characters with Unicode lookalikes.
We show that we can robustly detect hashes from BLOOM-176B's training data, as long as they occurred at least 90 times.
arXiv Detail & Related papers (2024-02-16T18:49:27Z) - Turning Your Strength into Watermark: Watermarking Large Language Model via Knowledge Injection [66.26348985345776]
We propose a novel watermarking method for large language models (LLMs) based on knowledge injection.
In the watermark embedding stage, we first embed the watermarks into the selected knowledge to obtain the watermarked knowledge.
In the watermark extraction stage, questions related to the watermarked knowledge are designed, for querying the suspect LLM.
Experiments show that the watermark extraction success rate is close to 100% and demonstrate the effectiveness, fidelity, stealthiness, and robustness of our proposed method.
arXiv Detail & Related papers (2023-11-16T03:22:53Z) - An Unforgeable Publicly Verifiable Watermark for Large Language Models [84.2805275589553]
Current watermark detection algorithms require the secret key used in the watermark generation process, making them susceptible to security breaches and counterfeiting during public detection.
We propose an unforgeable publicly verifiable watermark algorithm named UPV that uses two different neural networks for watermark generation and detection, instead of using the same key at both stages.
arXiv Detail & Related papers (2023-07-30T13:43:27Z) - On the Reliability of Watermarks for Large Language Models [95.87476978352659]
We study the robustness of watermarked text after it is re-written by humans, paraphrased by a non-watermarked LLM, or mixed into a longer hand-written document.
We find that watermarks remain detectable even after human and machine paraphrasing.
We also consider a range of new detection schemes that are sensitive to short spans of watermarked text embedded inside a large document.
arXiv Detail & Related papers (2023-06-07T17:58:48Z) - A Watermark for Large Language Models [84.95327142027183]
We propose a watermarking framework for proprietary language models.
The watermark can be embedded with negligible impact on text quality.
It can be detected using an efficient open-source algorithm without access to the language model API or parameters.
arXiv Detail & Related papers (2023-01-24T18:52:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.