An Ensemble Framework for Unbiased Language Model Watermarking
- URL: http://arxiv.org/abs/2509.24043v1
- Date: Sun, 28 Sep 2025 19:37:44 GMT
- Title: An Ensemble Framework for Unbiased Language Model Watermarking
- Authors: Yihan Wu, Ruibo Chen, Georgios Milis, Heng Huang,
- Abstract summary: We propose ENS, a novel ensemble framework that enhances the detectability and robustness of unbiased watermarks.<n>ENS sequentially composes multiple independent watermark instances, each governed by a distinct key, to amplify the watermark signal.<n> Empirical evaluations show that ENS substantially reduces the number of tokens needed for reliable detection and increases resistance to smoothing and paraphrasing attacks.
- Score: 60.99969104552168
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As large language models become increasingly capable and widely deployed, verifying the provenance of machine-generated content is critical to ensuring trust, safety, and accountability. Watermarking techniques have emerged as a promising solution by embedding imperceptible statistical signals into the generation process. Among them, unbiased watermarking is particularly attractive due to its theoretical guarantee of preserving the language model's output distribution, thereby avoiding degradation in fluency or detectability through distributional shifts. However, existing unbiased watermarking schemes often suffer from weak detection power and limited robustness, especially under short text lengths or distributional perturbations. In this work, we propose ENS, a novel ensemble framework that enhances the detectability and robustness of logits-based unbiased watermarks while strictly preserving their unbiasedness. ENS sequentially composes multiple independent watermark instances, each governed by a distinct key, to amplify the watermark signal. We theoretically prove that the ensemble construction remains unbiased in expectation and demonstrate how it improves the signal-to-noise ratio for statistical detectors. Empirical evaluations on multiple LLM families show that ENS substantially reduces the number of tokens needed for reliable detection and increases resistance to smoothing and paraphrasing attacks without compromising generation quality.
Related papers
- MirrorMark: A Distortion-Free Multi-Bit Watermark for Large Language Models [5.735801967350819]
We propose MirrorMark, a distortion-free watermark for large language models (LLMs)<n>MirrorMark embeds multi-bit messages without altering the token probability distribution, preserving text quality by design.<n> Experiments show that MirrorMark matches the text quality of non-watermarked generation while achieving substantially stronger detectability.
arXiv Detail & Related papers (2026-01-29T19:10:48Z) - Analyzing and Evaluating Unbiased Language Model Watermark [62.982950935139534]
We introduce UWbench, the first open-source benchmark dedicated to the principled evaluation of unbiased watermarking methods.<n>Our framework combines theoretical and empirical contributions.<n>We establish a three-axis evaluation protocol: unbiasedness, detectability, and robustness, and show that token modification attacks provide more stable robustness assessments than paraphrasing-based methods.
arXiv Detail & Related papers (2025-09-28T19:46:01Z) - LLM Watermark Evasion via Bias Inversion [24.543675977310357]
We propose the emphBias-Inversion Rewriting Attack (BIRA), which is theoretically motivated and model-agnostic.<n>BIRA weakens the watermark signal by suppressing the logits of likely watermarked tokens during rewriting, without any knowledge of the underlying watermarking scheme.
arXiv Detail & Related papers (2025-09-27T00:24:57Z) - StableGuard: Towards Unified Copyright Protection and Tamper Localization in Latent Diffusion Models [55.05404953041403]
We propose a novel framework that seamlessly integrates a binary watermark into the diffusion generation process.<n>We show that StableGuard consistently outperforms state-of-the-art methods in image fidelity, watermark verification, and tampering localization.
arXiv Detail & Related papers (2025-09-22T16:35:19Z) - Character-Level Perturbations Disrupt LLM Watermarks [64.60090923837701]
We formalize the system model for Large Language Model (LLM) watermarking.<n>We characterize two realistic threat models constrained on limited access to the watermark detector.<n>We demonstrate character-level perturbations are significantly more effective for watermark removal under the most restrictive threat model.<n> Experiments confirm the superiority of character-level perturbations and the effectiveness of the Genetic Algorithm (GA) in removing watermarks under realistic constraints.
arXiv Detail & Related papers (2025-09-11T02:50:07Z) - Watermarking Degrades Alignment in Language Models: Analysis and Mitigation [8.866121740748447]
This paper presents a systematic analysis of how two popular watermarking approaches-Gumbel and KGW-affect truthfulness, safety, and helpfulness.<n>We propose an inference-time sampling method that uses an external reward model to restore alignment.
arXiv Detail & Related papers (2025-06-04T21:29:07Z) - Entropy-Guided Watermarking for LLMs: A Test-Time Framework for Robust and Traceable Text Generation [58.85645136534301]
Existing watermarking schemes for sampled text often face trade-offs between maintaining text quality and ensuring robust detection against various attacks.<n>We propose a novel watermarking scheme that improves both detectability and text quality by introducing a cumulative watermark entropy threshold.
arXiv Detail & Related papers (2025-04-16T14:16:38Z) - GaussMark: A Practical Approach for Structural Watermarking of Language Models [61.84270985214254]
GaussMark is a simple, efficient, and relatively robust scheme for watermarking large language models.<n>We show that GaussMark is reliable, efficient, and relatively robust to corruptions such as insertions, deletions, substitutions, and roundtrip translations.
arXiv Detail & Related papers (2025-01-17T22:30:08Z) - Debiasing Watermarks for Large Language Models via Maximal Coupling [24.937491193018623]
We present a novel green/red list watermarking approach that partitions the token set into green'' and red'' lists, subtly increasing the generation probability for green tokens.<n> Experimental results show that it outperforms prior techniques by preserving text quality while maintaining high detectability.<n>This research provides a promising watermarking solution for language models, balancing effective detection with minimal impact on text quality.
arXiv Detail & Related papers (2024-11-17T23:36:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.