Watermarking Language Models through Language Models
- URL: http://arxiv.org/abs/2411.05091v2
- Date: Fri, 20 Jun 2025 16:24:13 GMT
- Title: Watermarking Language Models through Language Models
- Authors: Agnibh Dasgupta, Abdullah Tanvir, Xin Zhong,
- Abstract summary: We introduce a prompt-guided watermarking framework that operates entirely at the input level and requires no access to model parameters or decoding logits.<n>We evaluate the framework over 25 combinations of Prompting and Marking LMs, such as GPT-4o, Mistral, LLaMA3, and DeepSeek.<n> Experimental results show that watermark signals generalize across architectures and remain robust under fine-tuning, model distillation, and prompt-based adversarial attacks.
- Score: 1.249418440326334
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Watermarking the outputs of large language models (LLMs) is critical for provenance tracing, content regulation, and model accountability. Existing approaches often rely on access to model internals or are constrained by static rules and token-level perturbations. Moreover, the idea of steering generative behavior via prompt-based instruction control remains largely underexplored. We introduce a prompt-guided watermarking framework that operates entirely at the input level and requires no access to model parameters or decoding logits. The framework comprises three cooperating components: a Prompting LM that synthesizes watermarking instructions from user prompts, a Marking LM that generates watermarked outputs conditioned on these instructions, and a Detecting LM trained to classify whether a response carries an embedded watermark. This modular design enables dynamic watermarking that adapts to individual prompts while remaining compatible with diverse LLM architectures, including both proprietary and open-weight models. We evaluate the framework over 25 combinations of Prompting and Marking LMs, such as GPT-4o, Mistral, LLaMA3, and DeepSeek. Experimental results show that watermark signals generalize across architectures and remain robust under fine-tuning, model distillation, and prompt-based adversarial attacks, demonstrating the effectiveness and robustness of the proposed approach.
Related papers
- WMVLM: Evaluating Diffusion Model Image Watermarking via Vision-Language Models [79.32764976020435]
Digital watermarking is essential for securing generated images from diffusion models.<n>Previous watermark evaluation methods lack a unified framework for both residual and semantic watermarks.<n>We proposeLM, the first unified and interpretable evaluation framework for diffusion model image watermarking via vision-language models.
arXiv Detail & Related papers (2026-01-29T12:14:32Z) - SEAL: Subspace-Anchored Watermarks for LLM Ownership [12.022506016268112]
We propose SEAL, a subspace-anchored watermarking framework for large language models.<n> SEAL embeds multi-bit signatures directly into the model's latent representational space, supporting both white-box and black-box verification scenarios.<n>We conduct comprehensive experiments on multiple benchmark datasets and six prominent LLMs to demonstrate SEAL's superior effectiveness, fidelity, efficiency, and robustness.
arXiv Detail & Related papers (2025-11-14T14:44:11Z) - PRO: Enabling Precise and Robust Text Watermark for Open-Source LLMs [33.70483974998233]
We propose PRO, a Precise and Robust text watermarking method for open-source models.<n>Pro substantially improves both watermark detectability and resilience to model modifications.
arXiv Detail & Related papers (2025-10-27T22:00:49Z) - Position: LLM Watermarking Should Align Stakeholders' Incentives for Practical Adoption [94.887133335656]
We revisit three classes of watermarking through this lens.<n>emphLLM text watermarking offers modest provider benefit when framed solely as an anti-misuse tool.<n>emphIn-context watermarking (ICW) is tailored for trusted parties, such as conference organizers or educators.
arXiv Detail & Related papers (2025-10-21T06:34:51Z) - Detecting Post-generation Edits to Watermarked LLM Outputs via Combinatorial Watermarking [51.417096446156926]
We introduce a new task: detecting post-generation edits locally made to watermarked LLM outputs.<n>We propose a pattern-based watermarking framework, which partitions the vocabulary into disjoint subsets and embeds the watermark.<n>We evaluate our method on open-source LLMs across a variety of editing scenarios, demonstrating strong empirical performance in edit localization.
arXiv Detail & Related papers (2025-10-02T03:33:12Z) - In-Context Watermarks for Large Language Models [71.29952527565749]
In-Context Watermarking (ICW) embeds watermarks into generated text solely through prompt engineering.<n>We investigate four ICW strategies at different levels of granularity, each paired with a tailored detection method.<n>Our experiments validate the feasibility of ICW as a model-agnostic, practical watermarking approach.
arXiv Detail & Related papers (2025-05-22T17:24:51Z) - Improved Unbiased Watermark for Large Language Models [59.00698153097887]
We introduce MCmark, a family of unbiased, Multi-Channel-based watermarks.
MCmark preserves the original distribution of the language model.
It offers significant improvements in detectability and robustness over existing unbiased watermarks.
arXiv Detail & Related papers (2025-02-16T21:02:36Z) - Towards Copyright Protection for Knowledge Bases of Retrieval-augmented Language Models via Reasoning [58.57194301645823]
Large language models (LLMs) are increasingly integrated into real-world personalized applications.<n>The valuable and often proprietary nature of the knowledge bases used in RAG introduces the risk of unauthorized usage by adversaries.<n>Existing methods that can be generalized as watermarking techniques to protect these knowledge bases typically involve poisoning or backdoor attacks.<n>We propose name for harmless' copyright protection of knowledge bases.
arXiv Detail & Related papers (2025-02-10T09:15:56Z) - SimMark: A Robust Sentence-Level Similarity-Based Watermarking Algorithm for Large Language Models [1.7188280334580197]
SimMark is a posthoc watermarking algorithm that makes large language models' outputs traceable without requiring access to the model's internal logits.<n> Experimental results demonstrate that SimMark sets a new benchmark for robust watermarking of LLM-generated content.
arXiv Detail & Related papers (2025-02-05T00:21:01Z) - MarkLLM: An Open-Source Toolkit for LLM Watermarking [80.00466284110269]
MarkLLM is an open-source toolkit for implementing LLM watermarking algorithms.
For evaluation, MarkLLM offers a comprehensive suite of 12 tools spanning three perspectives, along with two types of automated evaluation pipelines.
arXiv Detail & Related papers (2024-05-16T12:40:01Z) - ModelShield: Adaptive and Robust Watermark against Model Extraction Attack [58.46326901858431]
Large language models (LLMs) demonstrate general intelligence across a variety of machine learning tasks.<n> adversaries can still utilize model extraction attacks to steal the model intelligence encoded in model generation.<n> Watermarking technology offers a promising solution for defending against such attacks by embedding unique identifiers into the model-generated content.
arXiv Detail & Related papers (2024-05-03T06:41:48Z) - Topic-Based Watermarks for Large Language Models [46.71493672772134]
We propose a lightweight, topic-guided watermarking scheme for Large Language Model (LLM) output.<n>Our method achieves comparable perplexity to industry-leading systems, including Google's SynthID-Text.
arXiv Detail & Related papers (2024-04-02T17:49:40Z) - TokenMark: A Modality-Agnostic Watermark for Pre-trained Transformers [67.57928750537185]
TokenMark is a robust, modality-agnostic, robust watermarking system for pre-trained models.<n>It embeds the watermark by fine-tuning the pre-trained model on a set of specifically permuted data samples.<n>It significantly improves the robustness, efficiency, and universality of model watermarking.
arXiv Detail & Related papers (2024-03-09T08:54:52Z) - On the Learnability of Watermarks for Language Models [80.97358663708592]
We ask whether language models can directly learn to generate watermarked text.
We propose watermark distillation, which trains a student model to behave like a teacher model.
We find that models can learn to generate watermarked text with high detectability.
arXiv Detail & Related papers (2023-12-07T17:41:44Z) - Revisiting Topic-Guided Language Models [20.21486464604549]
We study four topic-guided language models and two baselines, evaluating the held-out predictive performance of each model on four corpora.
We find that none of these methods outperform a standard LSTM language model baseline, and most fail to learn good topics.
arXiv Detail & Related papers (2023-12-04T20:33:24Z) - Multilingual and Multi-topical Benchmark of Fine-tuned Language models and Large Language Models for Check-Worthy Claim Detection [1.4779899760345434]
This study compares the performance of (1) fine-tuned language models and (2) large language models on the task of check-worthy claim detection.
We composed a multilingual and multi-topical dataset comprising texts of various sources and styles.
arXiv Detail & Related papers (2023-11-10T15:36:35Z) - ClearMark: Intuitive and Robust Model Watermarking via Transposed Model
Training [50.77001916246691]
This paper introduces ClearMark, the first DNN watermarking method designed for intuitive human assessment.
ClearMark embeds visible watermarks, enabling human decision-making without rigid value thresholds.
It shows an 8,544-bit watermark capacity comparable to the strongest existing work.
arXiv Detail & Related papers (2023-10-25T08:16:55Z) - REMARK-LLM: A Robust and Efficient Watermarking Framework for Generative Large Language Models [16.243415709584077]
We present REMARK-LLM, a novel efficient, and robust watermarking framework for large language models (LLMs)
ReMARK-LLM is rigorously trained to encourage the preservation of semantic integrity in watermarked content.
It exhibits better resilience against a spectrum of watermark detection and removal attacks.
arXiv Detail & Related papers (2023-10-18T22:14:37Z) - Watermarking LLMs with Weight Quantization [61.63899115699713]
This paper proposes a novel watermarking strategy that plants watermarks in the quantization process of large language models.
We successfully plant the watermark into open-source large language model weights including GPT-Neo and LLaMA.
arXiv Detail & Related papers (2023-10-17T13:06:59Z) - A Resilient and Accessible Distribution-Preserving Watermark for Large Language Models [65.40460716619772]
Our research focuses on the importance of a textbfDistribution-textbfPreserving (DiP) watermark.
Contrary to the current strategies, our proposed DiPmark simultaneously preserves the original token distribution during watermarking.
It is detectable without access to the language model API and prompts (accessible), and is provably robust to moderate changes of tokens.
arXiv Detail & Related papers (2023-10-11T17:57:35Z) - Multilingual Conceptual Coverage in Text-to-Image Models [98.80343331645626]
"Conceptual Coverage Across Languages" (CoCo-CroLa) is a technique for benchmarking the degree to which any generative text-to-image system provides multilingual parity to its training language in terms of tangible nouns.
For each model we can assess "conceptual coverage" of a given target language relative to a source language by comparing the population of images generated for a series of tangible nouns in the source language to the population of images generated for each noun under translation in the target language.
arXiv Detail & Related papers (2023-06-02T17:59:09Z) - A Watermark for Large Language Models [84.95327142027183]
We propose a watermarking framework for proprietary language models.
The watermark can be embedded with negligible impact on text quality.
It can be detected using an efficient open-source algorithm without access to the language model API or parameters.
arXiv Detail & Related papers (2023-01-24T18:52:59Z) - Read Like Humans: Autonomous, Bidirectional and Iterative Language
Modeling for Scene Text Recognition [80.446770909975]
Linguistic knowledge is of great benefit to scene text recognition.
How to effectively model linguistic rules in end-to-end deep networks remains a research challenge.
We propose an autonomous, bidirectional and iterative ABINet for scene text recognition.
arXiv Detail & Related papers (2021-03-11T06:47:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.