Related papers: Defending LLM Watermarking Against Spoofing Attacks with Contrastive Representation Learning

Defending LLM Watermarking Against Spoofing Attacks with Contrastive Representation Learning

URL: http://arxiv.org/abs/2504.06575v2
Date: Thu, 10 Apr 2025 03:23:40 GMT
Title: Defending LLM Watermarking Against Spoofing Attacks with Contrastive Representation Learning
Authors: Li An, Yujian Liu, Yepeng Liu, Yang Zhang, Yuheng Bu, Shiyu Chang,
Abstract summary: A piggyback attack can maliciously alter the meaning of watermarked text-transforming it into hate speech-while preserving the original watermark.<n>We propose a semantic-aware watermarking algorithm that embeds watermarks into a given target text while preserving its original meaning.
Score: 34.76886510334969
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Watermarking has emerged as a promising technique for detecting texts generated by LLMs. Current research has primarily focused on three design criteria: high quality of the watermarked text, high detectability, and robustness against removal attack. However, the security against spoofing attacks remains relatively understudied. For example, a piggyback attack can maliciously alter the meaning of watermarked text-transforming it into hate speech-while preserving the original watermark, thereby damaging the reputation of the LLM provider. We identify two core challenges that make defending against spoofing difficult: (1) the need for watermarks to be both sensitive to semantic-distorting changes and insensitive to semantic-preserving edits, and (2) the contradiction between the need to detect global semantic shifts and the local, auto-regressive nature of most watermarking schemes. To address these challenges, we propose a semantic-aware watermarking algorithm that post-hoc embeds watermarks into a given target text while preserving its original meaning. Our method introduces a semantic mapping model, which guides the generation of a green-red token list, contrastively trained to be sensitive to semantic-distorting changes and insensitive to semantic-preserving changes. Experiments on two standard benchmarks demonstrate strong robustness against removal attacks and security against spoofing attacks, including sentiment reversal and toxic content insertion, while maintaining high watermark detectability. Our approach offers a significant step toward more secure and semantically aware watermarking for LLMs. Our code is available at https://github.com/UCSB-NLP-Chang/contrastive-watermark.

Related papers

LLM Watermark Evasion via Bias Inversion [24.543675977310357]
We propose the emphBias-Inversion Rewriting Attack (BIRA), which is theoretically motivated and model-agnostic.<n>BIRA weakens the watermark signal by suppressing the logits of likely watermarked tokens during rewriting, without any knowledge of the underlying watermarking scheme.
arXiv Detail & Related papers (2025-09-27T00:24:57Z)
Character-Level Perturbations Disrupt LLM Watermarks [64.60090923837701]
We formalize the system model for Large Language Model (LLM) watermarking.<n>We characterize two realistic threat models constrained on limited access to the watermark detector.<n>We demonstrate character-level perturbations are significantly more effective for watermark removal under the most restrictive threat model.<n> Experiments confirm the superiority of character-level perturbations and the effectiveness of the Genetic Algorithm (GA) in removing watermarks under realistic constraints.
arXiv Detail & Related papers (2025-09-11T02:50:07Z)
SEAL: Semantic Aware Image Watermarking [26.606008778795193]
We propose a novel watermarking method that embeds semantic information about the generated image directly into the watermark. The key pattern can be inferred from the semantic embedding of the image using locality-sensitive hashing. Our results suggest that content-aware watermarks can mitigate risks arising from image-generative models.
arXiv Detail & Related papers (2025-03-15T15:29:05Z)
Modification and Generated-Text Detection: Achieving Dual Detection Capabilities for the Outputs of LLM by Watermark [6.355836060419373]
One practical solution is to embed a watermark in the text, allowing ownership verification through watermark extraction.<n>Existing methods primarily focus on defending against modification attacks, often neglecting other spoofing attacks.<n>We propose a technique to detect modifications in text for unbiased watermark which is sensitive to modification.
arXiv Detail & Related papers (2025-02-12T11:56:40Z)
Towards Copyright Protection for Knowledge Bases of Retrieval-augmented Language Models via Reasoning [58.57194301645823]
Large language models (LLMs) are increasingly integrated into real-world personalized applications.<n>The valuable and often proprietary nature of the knowledge bases used in RAG introduces the risk of unauthorized usage by adversaries.<n>Existing methods that can be generalized as watermarking techniques to protect these knowledge bases typically involve poisoning or backdoor attacks.<n>We propose name for harmless' copyright protection of knowledge bases.
arXiv Detail & Related papers (2025-02-10T09:15:56Z)
Your Semantic-Independent Watermark is Fragile: A Semantic Perturbation Attack against EaaS Watermark [5.2431999629987]
Various studies have proposed backdoor-based watermarking schemes to protect the copyright of E services.<n>In this paper, we reveal that previous watermarking schemes possess semantic-independent characteristics and propose the Semantic Perturbation Attack (SPA)<n>Our theoretical and experimental analysis demonstrate that this semantic-independent nature makes current watermarking schemes vulnerable to adaptive attacks that exploit semantic perturbations tests to bypass watermark verification.
arXiv Detail & Related papers (2024-11-14T11:06:34Z)
ESpeW: Robust Copyright Protection for LLM-based EaaS via Embedding-Specific Watermark [50.08021440235581]
Embeds as a Service (Eding) is emerging as a crucial role in AI applications. Eding is vulnerable to model extraction attacks, highlighting the urgent need for copyright protection. We propose a novel embedding-specific watermarking (ESpeW) mechanism to offer robust copyright protection for Eding.
arXiv Detail & Related papers (2024-10-23T04:34:49Z)
Optimizing Adaptive Attacks against Watermarks for Language Models [5.798432964668272]
Large Language Models (LLMs) can be misused to spread unwanted content at scale.<n> watermarking deters misuse by hiding messages in content, enabling its detection using a secret watermarking key.<n>We formulate watermark robustness as an objective function and use preference-based optimization to tune adaptive attacks against the specific watermarking method.
arXiv Detail & Related papers (2024-10-03T12:37:39Z)
On Evaluating The Performance of Watermarked Machine-Generated Texts Under Adversarial Attacks [20.972194348901958]
We first comb the mainstream watermarking schemes and removal attacks on machine-generated texts. We evaluate eight watermarks (five pre-text, three post-text) and twelve attacks (two pre-text, ten post-text) across 87 scenarios. Results indicate that KGW and Exponential watermarks offer high text quality and watermark retention but remain vulnerable to most attacks.
arXiv Detail & Related papers (2024-07-05T18:09:06Z)
Large Language Model Watermark Stealing With Mixed Integer Programming [51.336009662771396]
Large Language Model (LLM) watermark shows promise in addressing copyright, monitoring AI-generated text, and preventing its misuse. Recent research indicates that watermarking methods using numerous keys are susceptible to removal attacks. We propose a novel green list stealing attack against the state-of-the-art LLM watermark scheme.
arXiv Detail & Related papers (2024-05-30T04:11:17Z)
Adaptive Text Watermark for Large Language Models [8.100123266517299]
It is challenging to generate high-quality watermarked text while maintaining strong security, robustness, and the ability to detect watermarks without prior knowledge of the prompt or model. This paper proposes an adaptive watermarking strategy to address this problem.
arXiv Detail & Related papers (2024-01-25T03:57:12Z)
A Robust Semantics-based Watermark for Large Language Model against Paraphrasing [50.84892876636013]
Large language models (LLMs) have show great ability in various natural language tasks. There are concerns that LLMs are possible to be used improperly or even illegally. We propose a semantics-based watermark framework SemaMark.
arXiv Detail & Related papers (2023-11-15T06:19:02Z)
SemStamp: A Semantic Watermark with Paraphrastic Robustness for Text Generation [72.10931780019297]
Existing watermarking algorithms are vulnerable to paraphrase attacks because of their token-level design. We propose SemStamp, a robust sentence-level semantic watermarking algorithm based on locality-sensitive hashing (LSH) Experimental results show that our novel semantic watermark algorithm is not only more robust than the previous state-of-the-art method on both common and bigram paraphrase attacks, but also is better at preserving the quality of generation.
arXiv Detail & Related papers (2023-10-06T03:33:42Z)
On the Reliability of Watermarks for Large Language Models [95.87476978352659]
We study the robustness of watermarked text after it is re-written by humans, paraphrased by a non-watermarked LLM, or mixed into a longer hand-written document. We find that watermarks remain detectable even after human and machine paraphrasing. We also consider a range of new detection schemes that are sensitive to short spans of watermarked text embedded inside a large document.
arXiv Detail & Related papers (2023-06-07T17:58:48Z)
Tracing Text Provenance via Context-Aware Lexical Substitution [81.49359106648735]
We propose a natural language watermarking scheme based on context-aware lexical substitution. Under both objective and subjective metrics, our watermarking scheme can well preserve the semantic integrity of original sentences.
arXiv Detail & Related papers (2021-12-15T04:27:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.