Related papers: SAEMark: Multi-bit LLM Watermarking with Inference-Time Scaling

SAEMark: Multi-bit LLM Watermarking with Inference-Time Scaling

URL: http://arxiv.org/abs/2508.08211v1
Date: Mon, 11 Aug 2025 17:33:18 GMT
Title: SAEMark: Multi-bit LLM Watermarking with Inference-Time Scaling
Authors: Zhuohao Yu, Xingru Jiang, Weizheng Gu, Yidong Wang, Shikun Zhang, Wei Ye,
Abstract summary: SAEMark is a general framework for post-hoc multi-bit watermarking.<n>It embeds personalized messages solely via inference-time, feature-based rejection sampling.<n>We show SAEMark's consistent performance, with 99.7% F1 on English and strong multi-bit detection accuracy.
Score: 24.603169307967338
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Watermarking LLM-generated text is critical for content attribution and misinformation prevention. However, existing methods compromise text quality, require white-box model access and logit manipulation. These limitations exclude API-based models and multilingual scenarios. We propose SAEMark, a general framework for post-hoc multi-bit watermarking that embeds personalized messages solely via inference-time, feature-based rejection sampling without altering model logits or requiring training. Our approach operates on deterministic features extracted from generated text, selecting outputs whose feature statistics align with key-derived targets. This framework naturally generalizes across languages and domains while preserving text quality through sampling LLM outputs instead of modifying. We provide theoretical guarantees relating watermark success probability and compute budget that hold for any suitable feature extractor. Empirically, we demonstrate the framework's effectiveness using Sparse Autoencoders (SAEs), achieving superior detection accuracy and text quality. Experiments across 4 datasets show SAEMark's consistent performance, with 99.7% F1 on English and strong multi-bit detection accuracy. SAEMark establishes a new paradigm for scalable watermarking that works out-of-the-box with closed-source LLMs while enabling content attribution.

Related papers

MirrorMark: A Distortion-Free Multi-Bit Watermark for Large Language Models [5.735801967350819]
We propose MirrorMark, a distortion-free watermark for large language models (LLMs)<n>MirrorMark embeds multi-bit messages without altering the token probability distribution, preserving text quality by design.<n> Experiments show that MirrorMark matches the text quality of non-watermarked generation while achieving substantially stronger detectability.
arXiv Detail & Related papers (2026-01-29T19:10:48Z)
Detecting Post-generation Edits to Watermarked LLM Outputs via Combinatorial Watermarking [51.417096446156926]
We introduce a new task: detecting post-generation edits locally made to watermarked LLM outputs.<n>We propose a pattern-based watermarking framework, which partitions the vocabulary into disjoint subsets and embeds the watermark.<n>We evaluate our method on open-source LLMs across a variety of editing scenarios, demonstrating strong empirical performance in edit localization.
arXiv Detail & Related papers (2025-10-02T03:33:12Z)
RepreGuard: Detecting LLM-Generated Text by Revealing Hidden Representation Patterns [50.401907401444404]
Large language models (LLMs) are crucial for preventing misuse and building trustworthy AI systems.<n>We propose RepreGuard, an efficient statistics-based detection method.<n> Experimental results show that RepreGuard outperforms all baselines with average 94.92% AUROC on both in-distribution (ID) and OOD scenarios.
arXiv Detail & Related papers (2025-08-18T17:59:15Z)
StealthInk: A Multi-bit and Stealthy Watermark for Large Language Models [4.76514657698929]
StealthInk is a stealthy multi-bit watermarking scheme for large language models (LLMs)<n>It preserves the original text distribution while enabling the embedding of provenance data.<n>We derive a lower bound on the number of tokens necessary for watermark detection at a fixed equal error rate.
arXiv Detail & Related papers (2025-06-05T18:37:38Z)
Optimized Couplings for Watermarking Large Language Models [8.585779208433465]
Large-language models (LLMs) are now able to produce text that is, in many cases, seemingly indistinguishable from human-generated content.<n>This paper provides an analysis of text watermarking in a one-shot setting.
arXiv Detail & Related papers (2025-05-13T18:08:12Z)
Entropy-Guided Watermarking for LLMs: A Test-Time Framework for Robust and Traceable Text Generation [58.85645136534301]
Existing watermarking schemes for sampled text often face trade-offs between maintaining text quality and ensuring robust detection against various attacks.<n>We propose a novel watermarking scheme that improves both detectability and text quality by introducing a cumulative watermark entropy threshold.
arXiv Detail & Related papers (2025-04-16T14:16:38Z)
Idiosyncrasies in Large Language Models [54.26923012617675]
We unveil and study idiosyncrasies in Large Language Models (LLMs)<n>We find that fine-tuning text embedding models on LLM-generated texts yields excellent classification accuracy.<n>We leverage LLM as judges to generate detailed, open-ended descriptions of each model's idiosyncrasies.
arXiv Detail & Related papers (2025-02-17T18:59:02Z)
SimMark: A Robust Sentence-Level Similarity-Based Watermarking Algorithm for Large Language Models [1.7188280334580197]
SimMark is a posthoc watermarking algorithm that makes large language models' outputs traceable without requiring access to the model's internal logits.<n> Experimental results demonstrate that SimMark sets a new benchmark for robust watermarking of LLM-generated content.
arXiv Detail & Related papers (2025-02-05T00:21:01Z)
DERMARK: A Dynamic, Efficient and Robust Multi-bit Watermark for Large Language Models [18.023143082876015]
We propose a dynamic, efficient, and robust multi-bit watermarking method that divides the text into variable-length segments for each watermark bit.<n>Our method reduces the number of tokens required per embedded bit by 25%, reduces watermark embedding time by 50%, and maintains high robustness against text modifications and watermark erasure attacks.
arXiv Detail & Related papers (2025-02-04T11:23:49Z)
Who Wrote This? The Key to Zero-Shot LLM-Generated Text Detection Is GECScore [51.65730053591696]
We propose a simple yet effective black-box zero-shot detection approach based on the observation that human-written texts typically contain more grammatical errors than LLM-generated texts.<n> Experimental results show that our method outperforms current state-of-the-art (SOTA) zero-shot and supervised methods.
arXiv Detail & Related papers (2024-05-07T12:57:01Z)
Token-Specific Watermarking with Enhanced Detectability and Semantic Coherence for Large Language Models [31.062753031312006]
Large language models generate high-quality responses with potential misinformation. Watermarking is pivotal in this context, which involves embedding hidden markers in texts. We introduce a novel multi-objective optimization (MOO) approach for watermarking. Our method simultaneously achieves detectability and semantic integrity.
arXiv Detail & Related papers (2024-02-28T05:43:22Z)
Towards Codable Watermarking for Injecting Multi-bits Information to LLMs [86.86436777626959]
Large language models (LLMs) generate texts with increasing fluency and realism. Existing watermarking methods are encoding-inefficient and cannot flexibly meet the diverse information encoding needs. We propose Codable Text Watermarking for LLMs (CTWL) that allows text watermarks to carry multi-bit customizable information.
arXiv Detail & Related papers (2023-07-29T14:11:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.