Discovering Clues of Spoofed LM Watermarks
- URL: http://arxiv.org/abs/2410.02693v1
- Date: Thu, 3 Oct 2024 17:18:37 GMT
- Title: Discovering Clues of Spoofed LM Watermarks
- Authors: Thibaud Gloaguen, Nikola Jovanović, Robin Staab, Martin Vechev,
- Abstract summary: We show that there are observable differences between genuine and spoofed watermark texts.
We propose rigorous statistical tests that reliably reveal the presence of such artifacts.
- Score: 1.9374282535132377
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: LLM watermarks stand out as a promising way to attribute ownership of LLM-generated text. One threat to watermark credibility comes from spoofing attacks, where an unauthorized third party forges the watermark, enabling it to falsely attribute arbitrary texts to a particular LLM. While recent works have demonstrated that state-of-the-art schemes are in fact vulnerable to spoofing, they lack deeper qualitative analysis of the texts produced by spoofing methods. In this work, we for the first time reveal that there are observable differences between genuine and spoofed watermark texts. Namely, we show that regardless of their underlying approach, all current spoofing methods consistently leave observable artifacts in spoofed texts, indicative of watermark forgery. We build upon these findings to propose rigorous statistical tests that reliably reveal the presence of such artifacts, effectively discovering that a watermark was spoofed. Our experimental evaluation shows high test power across all current spoofing methods, providing insights into their fundamental limitations, and suggesting a way to mitigate this threat.
Related papers
- Defending LLM Watermarking Against Spoofing Attacks with Contrastive Representation Learning [34.76886510334969]
A piggyback attack can maliciously alter the meaning of watermarked text-transforming it into hate speech-while preserving the original watermark.
We propose a semantic-aware watermarking algorithm that embeds watermarks into a given target text while preserving its original meaning.
arXiv Detail & Related papers (2025-04-09T04:38:17Z) - Modification and Generated-Text Detection: Achieving Dual Detection Capabilities for the Outputs of LLM by Watermark [6.355836060419373]
One practical solution is to embed a watermark in the text, allowing ownership verification through watermark extraction.
Existing methods primarily focus on defending against modification attacks, often neglecting other spoofing attacks.
We propose a technique to detect modifications in text for unbiased watermark which is sensitive to modification.
arXiv Detail & Related papers (2025-02-12T11:56:40Z) - Revisiting the Robustness of Watermarking to Paraphrasing Attacks [10.68370011459729]
Many recent watermarking techniques modify the output probabilities of LMs to embed a signal in the generated output that can later be detected.
We show that with access to only a limited number of generations from a black-box watermarked model, we can drastically increase the effectiveness of paraphrasing attacks to evade watermark detection.
arXiv Detail & Related papers (2024-11-08T02:22:30Z) - Can Watermarked LLMs be Identified by Users via Crafted Prompts? [55.460327393792156]
This work is the first to investigate the imperceptibility of watermarked Large Language Models (LLMs)
We design an identification algorithm called Water-Probe that detects watermarks through well-designed prompts.
Experiments show that almost all mainstream watermarking algorithms are easily identified with our well-designed prompts.
arXiv Detail & Related papers (2024-10-04T06:01:27Z) - Can Watermarking Large Language Models Prevent Copyrighted Text Generation and Hide Training Data? [62.72729485995075]
We investigate the effectiveness of watermarking as a deterrent against the generation of copyrighted texts.
We find that watermarking adversely affects the success rate of Membership Inference Attacks (MIAs)
We propose an adaptive technique to improve the success rate of a recent MIA under watermarking.
arXiv Detail & Related papers (2024-07-24T16:53:09Z) - On Evaluating The Performance of Watermarked Machine-Generated Texts Under Adversarial Attacks [20.972194348901958]
We first comb the mainstream watermarking schemes and removal attacks on machine-generated texts.
We evaluate eight watermarks (five pre-text, three post-text) and twelve attacks (two pre-text, ten post-text) across 87 scenarios.
Results indicate that KGW and Exponential watermarks offer high text quality and watermark retention but remain vulnerable to most attacks.
arXiv Detail & Related papers (2024-07-05T18:09:06Z) - Bileve: Securing Text Provenance in Large Language Models Against Spoofing with Bi-level Signature [39.973130114073605]
We introduce a bi-level signature scheme, Bileve, which embeds fine-grained signature bits for integrity checks.
Bileve can differentiate 5 scenarios during detection, reliably tracing text and regulating LLMs.
arXiv Detail & Related papers (2024-06-04T03:58:14Z) - Watermark Stealing in Large Language Models [2.1165011830664673]
We show that querying the API of the watermarked LLM to approximately reverse-engineer a watermark enables practical spoofing attacks.
We are the first to propose an automated WS algorithm and use it in the first comprehensive study of spoofing and scrubbing in realistic settings.
arXiv Detail & Related papers (2024-02-29T17:12:39Z) - WatME: Towards Lossless Watermarking Through Lexical Redundancy [58.61972059246715]
This study assesses the impact of watermarking on different capabilities of large language models (LLMs) from a cognitive science lens.
We introduce Watermarking with Mutual Exclusion (WatME) to seamlessly integrate watermarks.
arXiv Detail & Related papers (2023-11-16T11:58:31Z) - Turning Your Strength into Watermark: Watermarking Large Language Model via Knowledge Injection [66.26348985345776]
We propose a novel watermarking method for large language models (LLMs) based on knowledge injection.
In the watermark embedding stage, we first embed the watermarks into the selected knowledge to obtain the watermarked knowledge.
In the watermark extraction stage, questions related to the watermarked knowledge are designed, for querying the suspect LLM.
Experiments show that the watermark extraction success rate is close to 100% and demonstrate the effectiveness, fidelity, stealthiness, and robustness of our proposed method.
arXiv Detail & Related papers (2023-11-16T03:22:53Z) - On the Reliability of Watermarks for Large Language Models [95.87476978352659]
We study the robustness of watermarked text after it is re-written by humans, paraphrased by a non-watermarked LLM, or mixed into a longer hand-written document.
We find that watermarks remain detectable even after human and machine paraphrasing.
We also consider a range of new detection schemes that are sensitive to short spans of watermarked text embedded inside a large document.
arXiv Detail & Related papers (2023-06-07T17:58:48Z) - Tracing Text Provenance via Context-Aware Lexical Substitution [81.49359106648735]
We propose a natural language watermarking scheme based on context-aware lexical substitution.
Under both objective and subjective metrics, our watermarking scheme can well preserve the semantic integrity of original sentences.
arXiv Detail & Related papers (2021-12-15T04:27:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.