Bypassing LLM Watermarks with Color-Aware Substitutions
- URL: http://arxiv.org/abs/2403.14719v1
- Date: Tue, 19 Mar 2024 17:54:39 GMT
- Title: Bypassing LLM Watermarks with Color-Aware Substitutions
- Authors: Qilong Wu, Varun Chandrasekaran,
- Abstract summary: Self Color Testing-based Substitution (SCTS) is the first color-aware'' attack.
SCTS successfully evades watermark detection using fewer number of edits than related work.
We show both theoretically and empirically that SCTS can remove the watermark for arbitrarily long watermarked text.
- Score: 11.724935807582513
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Watermarking approaches are proposed to identify if text being circulated is human or large language model (LLM) generated. The state-of-the-art watermarking strategy of Kirchenbauer et al. (2023a) biases the LLM to generate specific (``green'') tokens. However, determining the robustness of this watermarking method is an open problem. Existing attack methods fail to evade detection for longer text segments. We overcome this limitation, and propose {\em Self Color Testing-based Substitution (SCTS)}, the first ``color-aware'' attack. SCTS obtains color information by strategically prompting the watermarked LLM and comparing output tokens frequencies. It uses this information to determine token colors, and substitutes green tokens with non-green ones. In our experiments, SCTS successfully evades watermark detection using fewer number of edits than related work. Additionally, we show both theoretically and empirically that SCTS can remove the watermark for arbitrarily long watermarked text.
Related papers
- Less is More: Sparse Watermarking in LLMs with Enhanced Text Quality [27.592486717044455]
We present a novel type of watermark, Sparse Watermark, which aims to mitigate this trade-off by applying watermarks to a small subset of generated tokens distributed across the text.
Our experimental results demonstrate that the proposed watermarking scheme achieves high detectability while generating text that outperforms previous watermarking methods in quality across various tasks.
arXiv Detail & Related papers (2024-07-17T18:52:12Z) - Large Language Model Watermark Stealing With Mixed Integer Programming [51.336009662771396]
Large Language Model (LLM) watermark shows promise in addressing copyright, monitoring AI-generated text, and preventing its misuse.
Recent research indicates that watermarking methods using numerous keys are susceptible to removal attacks.
We propose a novel green list stealing attack against the state-of-the-art LLM watermark scheme.
arXiv Detail & Related papers (2024-05-30T04:11:17Z) - Can Watermarks Survive Translation? On the Cross-lingual Consistency of Text Watermark for Large Language Models [48.409979469683975]
We introduce the concept of cross-lingual consistency in text watermarking.
Preliminary empirical results reveal that current text watermarking technologies lack consistency when texts are translated into various languages.
We propose a Cross-lingual Watermark Removal Attack (CWRA) to bypass watermarking.
arXiv Detail & Related papers (2024-02-21T18:48:38Z) - Provably Robust Multi-bit Watermarking for AI-generated Text via Error Correction Code [39.96262132464419]
Large Language Models (LLMs) have been widely deployed for their capability to generate texts resembling human language.
They could be misused by criminals to create deceptive content, such as fake news and phishing emails.
Watermarking is a key technique to mitigate the misuse of LLMs, which embeds a watermark into a text.
arXiv Detail & Related papers (2024-01-30T08:46:48Z) - Turning Your Strength into Watermark: Watermarking Large Language Model via Knowledge Injection [66.26348985345776]
We propose a novel watermarking method for large language models (LLMs) based on knowledge injection.
In the watermark embedding stage, we first embed the watermarks into the selected knowledge to obtain the watermarked knowledge.
In the watermark extraction stage, questions related to the watermarked knowledge are designed, for querying the suspect LLM.
Experiments show that the watermark extraction success rate is close to 100% and demonstrate the effectiveness, fidelity, stealthiness, and robustness of our proposed method.
arXiv Detail & Related papers (2023-11-16T03:22:53Z) - On the Reliability of Watermarks for Large Language Models [95.87476978352659]
We study the robustness of watermarked text after it is re-written by humans, paraphrased by a non-watermarked LLM, or mixed into a longer hand-written document.
We find that watermarks remain detectable even after human and machine paraphrasing.
We also consider a range of new detection schemes that are sensitive to short spans of watermarked text embedded inside a large document.
arXiv Detail & Related papers (2023-06-07T17:58:48Z) - Watermarking Text Generated by Black-Box Language Models [103.52541557216766]
A watermark-based method was proposed for white-box LLMs, allowing them to embed watermarks during text generation.
A detection algorithm aware of the list can identify the watermarked text.
We develop a watermarking framework for black-box language model usage scenarios.
arXiv Detail & Related papers (2023-05-14T07:37:33Z) - A Watermark for Large Language Models [84.95327142027183]
We propose a watermarking framework for proprietary language models.
The watermark can be embedded with negligible impact on text quality.
It can be detected using an efficient open-source algorithm without access to the language model API or parameters.
arXiv Detail & Related papers (2023-01-24T18:52:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.