Improving Detection of Watermarked Language Models
- URL: http://arxiv.org/abs/2508.13131v1
- Date: Mon, 18 Aug 2025 17:43:06 GMT
- Title: Improving Detection of Watermarked Language Models
- Authors: Dara Bahri, John Wieting,
- Abstract summary: We investigate whether detection can be improved by combining watermark detectors with non-watermark ones.<n>In this work, we investigate whether detection can be improved by combining watermark detectors with non-watermark ones.
- Score: 31.772364827073808
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Watermarking has recently emerged as an effective strategy for detecting the generations of large language models (LLMs). The strength of a watermark typically depends strongly on the entropy afforded by the language model and the set of input prompts. However, entropy can be quite limited in practice, especially for models that are post-trained, for example via instruction tuning or reinforcement learning from human feedback (RLHF), which makes detection based on watermarking alone challenging. In this work, we investigate whether detection can be improved by combining watermark detectors with non-watermark ones. We explore a number of hybrid schemes that combine the two, observing performance gains over either class of detector under a wide range of experimental conditions.
Related papers
- How Good is Post-Hoc Watermarking With Language Model Rephrasing? [43.5649433230903]
Generation-time text watermarking embeds statistical signals into text for traceability of AI-generated content.<n>We explore post-hoc watermarking where an LLM rewrites existing text while applying generation-time watermarking.<n>Our strategies achieve strong detectability and semantic fidelity on open-ended text such as books.
arXiv Detail & Related papers (2025-12-18T18:57:33Z) - Watermarks for Language Models via Probabilistic Automata [54.687037560547765]
We introduce a new class of watermarking schemes constructed through probabilistic automata.<n>We present two instantiations: (i) a practical scheme with exponential generation diversity and computational efficiency, and (ii) a theoretical construction with formal undetectability guarantees under cryptographic assumptions.
arXiv Detail & Related papers (2025-12-11T00:49:06Z) - WaterSearch: A Quality-Aware Search-based Watermarking Framework for Large Language Models [17.137667672391725]
WaterSearch is a sentence-level, search-based watermarking framework.<n>WaterSearch enhances text quality by jointly optimizing two key aspects: 1) distribution fidelity and 2) watermark signal characteristics.<n>Our method achieves an average performance improvement of 51.01% over state-of-the-art baselines.
arXiv Detail & Related papers (2025-11-30T11:11:21Z) - Entropy-Guided Watermarking for LLMs: A Test-Time Framework for Robust and Traceable Text Generation [58.85645136534301]
Existing watermarking schemes for sampled text often face trade-offs between maintaining text quality and ensuring robust detection against various attacks.<n>We propose a novel watermarking scheme that improves both detectability and text quality by introducing a cumulative watermark entropy threshold.
arXiv Detail & Related papers (2025-04-16T14:16:38Z) - Duwak: Dual Watermarks in Large Language Models [49.00264962860555]
We propose, Duwak, to enhance the efficiency and quality of watermarking by embedding dual secret patterns in both token probability distribution and sampling schemes.
We evaluate Duwak extensively on Llama2, against four state-of-the-art watermarking techniques and combinations of them.
arXiv Detail & Related papers (2024-03-12T16:25:38Z) - Token-Specific Watermarking with Enhanced Detectability and Semantic Coherence for Large Language Models [31.062753031312006]
Large language models generate high-quality responses with potential misinformation.
Watermarking is pivotal in this context, which involves embedding hidden markers in texts.
We introduce a novel multi-objective optimization (MOO) approach for watermarking.
Our method simultaneously achieves detectability and semantic integrity.
arXiv Detail & Related papers (2024-02-28T05:43:22Z) - On the Learnability of Watermarks for Language Models [80.97358663708592]
We ask whether language models can directly learn to generate watermarked text.
We propose watermark distillation, which trains a student model to behave like a teacher model.
We find that models can learn to generate watermarked text with high detectability.
arXiv Detail & Related papers (2023-12-07T17:41:44Z) - An Unforgeable Publicly Verifiable Watermark for Large Language Models [84.2805275589553]
Current watermark detection algorithms require the secret key used in the watermark generation process, making them susceptible to security breaches and counterfeiting during public detection.
We propose an unforgeable publicly verifiable watermark algorithm named UPV that uses two different neural networks for watermark generation and detection, instead of using the same key at both stages.
arXiv Detail & Related papers (2023-07-30T13:43:27Z) - A Watermark for Large Language Models [84.95327142027183]
We propose a watermarking framework for proprietary language models.
The watermark can be embedded with negligible impact on text quality.
It can be detected using an efficient open-source algorithm without access to the language model API or parameters.
arXiv Detail & Related papers (2023-01-24T18:52:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.