Analyzing and Evaluating Unbiased Language Model Watermark
- URL: http://arxiv.org/abs/2509.24048v1
- Date: Sun, 28 Sep 2025 19:46:01 GMT
- Title: Analyzing and Evaluating Unbiased Language Model Watermark
- Authors: Yihan Wu, Xuehao Cui, Ruibo Chen, Heng Huang,
- Abstract summary: We introduce UWbench, the first open-source benchmark dedicated to the principled evaluation of unbiased watermarking methods.<n>Our framework combines theoretical and empirical contributions.<n>We establish a three-axis evaluation protocol: unbiasedness, detectability, and robustness, and show that token modification attacks provide more stable robustness assessments than paraphrasing-based methods.
- Score: 62.982950935139534
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Verifying the authenticity of AI-generated text has become increasingly important with the rapid advancement of large language models, and unbiased watermarking has emerged as a promising approach due to its ability to preserve output distribution without degrading quality. However, recent work reveals that unbiased watermarks can accumulate distributional bias over multiple generations and that existing robustness evaluations are inconsistent across studies. To address these issues, we introduce UWbench, the first open-source benchmark dedicated to the principled evaluation of unbiased watermarking methods. Our framework combines theoretical and empirical contributions: we propose a statistical metric to quantify multi-batch distribution drift, prove an impossibility result showing that no unbiased watermark can perfectly preserve the distribution under infinite queries, and develop a formal analysis of robustness against token-level modification attacks. Complementing this theory, we establish a three-axis evaluation protocol: unbiasedness, detectability, and robustness, and show that token modification attacks provide more stable robustness assessments than paraphrasing-based methods. Together, UWbench offers the community a standardized and reproducible platform for advancing the design and evaluation of unbiased watermarking algorithms.
Related papers
- Towards Anytime-Valid Statistical Watermarking [63.02116925616554]
We develop the first e-value-based watermarking framework, Anchored E-Watermarking, that unifies optimal sampling with anytime-valid inference.<n>Our framework can significantly enhance sample efficiency, reducing the average token budget required for detection by 13-15% relative to state-of-the-art baselines.
arXiv Detail & Related papers (2026-02-19T18:32:26Z) - DWBench: Holistic Evaluation of Watermark for Dataset Copyright Auditing [43.881484429055654]
dataset watermark technique holds promise for auditing and verifying usage.<n>We develop DWBench, a unified benchmark and open-source toolkit for systematically evaluating image dataset watermark techniques.<n>We present the results of two new metrics: sample significance for fine-grained watermark distinguishability and verification success rate for dataset-level auditing.
arXiv Detail & Related papers (2026-02-14T01:09:19Z) - More Haste, Less Speed: Weaker Single-Layer Watermark Improves Distortion-Free Watermark Ensembles [58.941305935872265]
We show that strong watermarks significantly reduce the entropy of the token distribution.<n>We propose a framework that utilizes weaker single-layer watermarks to preserve the entropy required for effective multi-layer ensembling.
arXiv Detail & Related papers (2026-02-12T10:18:16Z) - An Ensemble Framework for Unbiased Language Model Watermarking [60.99969104552168]
We propose ENS, a novel ensemble framework that enhances the detectability and robustness of unbiased watermarks.<n>ENS sequentially composes multiple independent watermark instances, each governed by a distinct key, to amplify the watermark signal.<n> Empirical evaluations show that ENS substantially reduces the number of tokens needed for reliable detection and increases resistance to smoothing and paraphrasing attacks.
arXiv Detail & Related papers (2025-09-28T19:37:44Z) - CEFW: A Comprehensive Evaluation Framework for Watermark in Large Language Models [12.565502899825724]
We propose a unified framework that comprehensively evaluates watermarking methods across five key dimensions.<n>These include ease of detection, fidelity of text quality, minimal embedding cost, robustness to adversarial attacks, and imperceptibility to prevent imitation or forgery.<n>We introduce Balanced Watermark (BW), which guarantees robustness and imperceptibility through balancing the way watermark information is added.
arXiv Detail & Related papers (2025-03-24T13:50:32Z) - Improved Unbiased Watermark for Large Language Models [59.00698153097887]
We introduce MCmark, a family of unbiased, Multi-Channel-based watermarks.<n>MCmark preserves the original distribution of the language model.<n>It offers significant improvements in detectability and robustness over existing unbiased watermarks.
arXiv Detail & Related papers (2025-02-16T21:02:36Z) - Debiasing Watermarks for Large Language Models via Maximal Coupling [24.937491193018623]
We present a novel green/red list watermarking approach that partitions the token set into green'' and red'' lists, subtly increasing the generation probability for green tokens.<n> Experimental results show that it outperforms prior techniques by preserving text quality while maintaining high detectability.<n>This research provides a promising watermarking solution for language models, balancing effective detection with minimal impact on text quality.
arXiv Detail & Related papers (2024-11-17T23:36:37Z) - Theoretically Grounded Framework for LLM Watermarking: A Distribution-Adaptive Approach [53.32564762183639]
We introduce a novel, unified theoretical framework for watermarking Large Language Models (LLMs)<n>Our approach aims to maximize detection performance while maintaining control over the worst-case false positive rate (FPR) and distortion on text quality.<n>We propose a distortion-free, distribution-adaptive watermarking algorithm (DAWA) that leverages a surrogate model for model-agnosticism and efficiency.
arXiv Detail & Related papers (2024-10-03T18:28:10Z) - Watermarking Language Models with Error Correcting Codes [39.77377710480125]
We propose a watermarking framework that encodes statistical signals through an error correcting code.<n>Our method, termed robust binary code (RBC) watermark, introduces no noticeable degradation in quality.<n>Our empirical findings suggest our watermark is fast, powerful, and robust, comparing favorably to the state-of-the-art.
arXiv Detail & Related papers (2024-06-12T05:13:09Z) - RobustBench: a standardized adversarial robustness benchmark [84.50044645539305]
Key challenge in benchmarking robustness is that its evaluation is often error-prone leading to robustness overestimation.
We evaluate adversarial robustness with AutoAttack, an ensemble of white- and black-box attacks.
We analyze the impact of robustness on the performance on distribution shifts, calibration, out-of-distribution detection, fairness, privacy leakage, smoothness, and transferability.
arXiv Detail & Related papers (2020-10-19T17:06:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.