Revealing Weaknesses in Text Watermarking Through Self-Information Rewrite Attacks
- URL: http://arxiv.org/abs/2505.05190v2
- Date: Sun, 11 May 2025 14:24:22 GMT
- Title: Revealing Weaknesses in Text Watermarking Through Self-Information Rewrite Attacks
- Authors: Yixin Cheng, Hongcheng Guo, Yangming Li, Leonid Sigal,
- Abstract summary: Text watermarking algorithms embed watermarks in high-entropy tokens to ensure text quality.<n>In this paper, we reveal that this seemingly benign design can be exploited by attackers, posing a significant risk to the robustness of the watermark.<n>We introduce a generic efficient paraphrasing attack, which leverages the vulnerability by calculating the self-information of each token.
- Score: 36.01146548147208
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text watermarking aims to subtly embed statistical signals into text by controlling the Large Language Model (LLM)'s sampling process, enabling watermark detectors to verify that the output was generated by the specified model. The robustness of these watermarking algorithms has become a key factor in evaluating their effectiveness. Current text watermarking algorithms embed watermarks in high-entropy tokens to ensure text quality. In this paper, we reveal that this seemingly benign design can be exploited by attackers, posing a significant risk to the robustness of the watermark. We introduce a generic efficient paraphrasing attack, the Self-Information Rewrite Attack (SIRA), which leverages the vulnerability by calculating the self-information of each token to identify potential pattern tokens and perform targeted attack. Our work exposes a widely prevalent vulnerability in current watermarking algorithms. The experimental results show SIRA achieves nearly 100% attack success rates on seven recent watermarking methods with only 0.88 USD per million tokens cost. Our approach does not require any access to the watermark algorithms or the watermarked LLM and can seamlessly transfer to any LLM as the attack model, even mobile-level models. Our findings highlight the urgent need for more robust watermarking.
Related papers
- When There Is No Decoder: Removing Watermarks from Stable Diffusion Models in a No-box Setting [37.85082375268253]
We study the robustness of model-specific watermarking, where watermark embedding is integrated with text-to-image generation.<n>We introduce three attack strategies: edge prediction-based, box blurring, and fine-tuning-based attacks in a no-box setting.<n>Our best-performing attack achieves a reduction in watermark detection accuracy to approximately 47.92%.
arXiv Detail & Related papers (2025-07-04T15:22:20Z) - Revisiting the Robustness of Watermarking to Paraphrasing Attacks [10.68370011459729]
Many recent watermarking techniques modify the output probabilities of LMs to embed a signal in the generated output that can later be detected.
We show that with access to only a limited number of generations from a black-box watermarked model, we can drastically increase the effectiveness of paraphrasing attacks to evade watermark detection.
arXiv Detail & Related papers (2024-11-08T02:22:30Z) - Robustness of Watermarking on Text-to-Image Diffusion Models [9.277492743469235]
We investigate the robustness of generative watermarking, which is created from the integration of watermarking embedding and text-to-image generation processing.
We found that generative watermarking methods are robust to direct evasion attacks, like discriminator-based attacks, or manipulation based on the edge information in edge prediction-based attacks but vulnerable to malicious fine-tuning.
arXiv Detail & Related papers (2024-08-04T13:59:09Z) - Watermark Smoothing Attacks against Language Models [40.02225709485305]
We introduce the Smoothing Attack, a novel watermark removal method.<n>We validate our attack on open-source models ranging from $1.3$B to $30$B parameters.
arXiv Detail & Related papers (2024-07-19T11:04:54Z) - Large Language Model Watermark Stealing With Mixed Integer Programming [51.336009662771396]
Large Language Model (LLM) watermark shows promise in addressing copyright, monitoring AI-generated text, and preventing its misuse.
Recent research indicates that watermarking methods using numerous keys are susceptible to removal attacks.
We propose a novel green list stealing attack against the state-of-the-art LLM watermark scheme.
arXiv Detail & Related papers (2024-05-30T04:11:17Z) - Topic-Based Watermarks for Large Language Models [46.71493672772134]
We propose a lightweight, topic-guided watermarking scheme for Large Language Model (LLM) output.<n>Our method achieves comparable perplexity to industry-leading systems, including Google's SynthID-Text.
arXiv Detail & Related papers (2024-04-02T17:49:40Z) - Adaptive Text Watermark for Large Language Models [8.100123266517299]
It is challenging to generate high-quality watermarked text while maintaining strong security, robustness, and the ability to detect watermarks without prior knowledge of the prompt or model.
This paper proposes an adaptive watermarking strategy to address this problem.
arXiv Detail & Related papers (2024-01-25T03:57:12Z) - Improving the Generation Quality of Watermarked Large Language Models
via Word Importance Scoring [81.62249424226084]
Token-level watermarking inserts watermarks in the generated texts by altering the token probability distributions.
This watermarking algorithm alters the logits during generation, which can lead to a downgraded text quality.
We propose to improve the quality of texts generated by a watermarked language model by Watermarking with Importance Scoring (WIS)
arXiv Detail & Related papers (2023-11-16T08:36:00Z) - Turning Your Strength into Watermark: Watermarking Large Language Model via Knowledge Injection [66.26348985345776]
We propose a novel watermarking method for large language models (LLMs) based on knowledge injection.
In the watermark embedding stage, we first embed the watermarks into the selected knowledge to obtain the watermarked knowledge.
In the watermark extraction stage, questions related to the watermarked knowledge are designed, for querying the suspect LLM.
Experiments show that the watermark extraction success rate is close to 100% and demonstrate the effectiveness, fidelity, stealthiness, and robustness of our proposed method.
arXiv Detail & Related papers (2023-11-16T03:22:53Z) - On the Reliability of Watermarks for Large Language Models [95.87476978352659]
We study the robustness of watermarked text after it is re-written by humans, paraphrased by a non-watermarked LLM, or mixed into a longer hand-written document.
We find that watermarks remain detectable even after human and machine paraphrasing.
We also consider a range of new detection schemes that are sensitive to short spans of watermarked text embedded inside a large document.
arXiv Detail & Related papers (2023-06-07T17:58:48Z) - A Watermark for Large Language Models [84.95327142027183]
We propose a watermarking framework for proprietary language models.
The watermark can be embedded with negligible impact on text quality.
It can be detected using an efficient open-source algorithm without access to the language model API or parameters.
arXiv Detail & Related papers (2023-01-24T18:52:59Z) - Fine-tuning Is Not Enough: A Simple yet Effective Watermark Removal
Attack for DNN Models [72.9364216776529]
We propose a novel watermark removal attack from a different perspective.
We design a simple yet powerful transformation algorithm by combining imperceptible pattern embedding and spatial-level transformations.
Our attack can bypass state-of-the-art watermarking solutions with very high success rates.
arXiv Detail & Related papers (2020-09-18T09:14:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.