Can you Finetune your Binoculars? Embedding Text Watermarks into the Weights of Large Language Models
- URL: http://arxiv.org/abs/2504.06446v1
- Date: Tue, 08 Apr 2025 21:34:02 GMT
- Title: Can you Finetune your Binoculars? Embedding Text Watermarks into the Weights of Large Language Models
- Authors: Fay Elhassan, Niccolò Ajroldi, Antonio Orvieto, Jonas Geiping,
- Abstract summary: Indistinguishability of AI-generated content from human text raises challenges in transparency and accountability.<n>We propose a strategy to finetune a pair of low-rank adapters of a model, one serving as the text-generating model, and the other as the detector.<n>In this way, the watermarking strategy is fully learned end-to-end.
- Score: 33.051248579713736
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The indistinguishability of AI-generated content from human text raises challenges in transparency and accountability. While several methods exist to watermark models behind APIs, embedding watermark strategies directly into model weights that are later reflected in the outputs of the model is challenging. In this study we propose a strategy to finetune a pair of low-rank adapters of a model, one serving as the text-generating model, and the other as the detector, so that a subtle watermark is embedded into the text generated by the first model and simultaneously optimized for detectability by the second. In this way, the watermarking strategy is fully learned end-to-end. This process imposes an optimization challenge, as balancing watermark robustness, naturalness, and task performance requires trade-offs. We discuss strategies on how to optimize this min-max objective and present results showing the effect of this modification to instruction finetuning.
Related papers
- GaussMark: A Practical Approach for Structural Watermarking of Language Models [61.84270985214254]
GaussMark is a simple, efficient, and relatively robust scheme for watermarking large language models.<n>We show that GaussMark is reliable, efficient, and relatively robust to corruptions such as insertions, deletions, substitutions, and roundtrip translations.
arXiv Detail & Related papers (2025-01-17T22:30:08Z) - WAPITI: A Watermark for Finetuned Open-Source LLMs [42.1087852764299]
WAPITI is a new method that transfers watermarking from base models to fine-tuned models through parameter integration.<n>We show that our method can successfully inject watermarks and is highly compatible with fine-tuned models.
arXiv Detail & Related papers (2024-10-09T01:41:14Z) - Theoretically Grounded Framework for LLM Watermarking: A Distribution-Adaptive Approach [35.319577498993354]
We present a novel theoretical framework for watermarking Large Language Models (LLMs)<n>Our approach focuses on maximizing detection performance while maintaining control over the worst-case Type-I error and text distortion.<n>We propose an efficient, model-agnostic, distribution-adaptive watermarking algorithm, utilizing a surrogate model alongside the Gumbel-max trick.
arXiv Detail & Related papers (2024-10-03T18:28:10Z) - Trigger-Based Fragile Model Watermarking for Image Transformation Networks [2.38776871944507]
In fragile watermarking, a sensitive watermark is embedded in an object in a manner such that the watermark breaks upon tampering.
We introduce a novel, trigger-based fragile model watermarking system for image transformation/generation networks.
Our approach, distinct from robust watermarking, effectively verifies the model's source and integrity across various datasets and attacks.
arXiv Detail & Related papers (2024-09-28T19:34:55Z) - TokenMark: A Modality-Agnostic Watermark for Pre-trained Transformers [67.57928750537185]
TokenMark is a robust, modality-agnostic, robust watermarking system for pre-trained models.<n>It embeds the watermark by fine-tuning the pre-trained model on a set of specifically permuted data samples.<n>It significantly improves the robustness, efficiency, and universality of model watermarking.
arXiv Detail & Related papers (2024-03-09T08:54:52Z) - Token-Specific Watermarking with Enhanced Detectability and Semantic Coherence for Large Language Models [31.062753031312006]
Large language models generate high-quality responses with potential misinformation.
Watermarking is pivotal in this context, which involves embedding hidden markers in texts.
We introduce a novel multi-objective optimization (MOO) approach for watermarking.
Our method simultaneously achieves detectability and semantic integrity.
arXiv Detail & Related papers (2024-02-28T05:43:22Z) - Cross-Attention Watermarking of Large Language Models [8.704964543257246]
New approach to linguistic watermarking of language models is presented.
Information is imperceptibly inserted into the output text while preserving its readability and original meaning.
Cross-attention mechanism is used to embed watermarks in the text during inference.
arXiv Detail & Related papers (2024-01-12T09:39:50Z) - On the Learnability of Watermarks for Language Models [80.97358663708592]
We ask whether language models can directly learn to generate watermarked text.
We propose watermark distillation, which trains a student model to behave like a teacher model.
We find that models can learn to generate watermarked text with high detectability.
arXiv Detail & Related papers (2023-12-07T17:41:44Z) - Improving the Generation Quality of Watermarked Large Language Models
via Word Importance Scoring [81.62249424226084]
Token-level watermarking inserts watermarks in the generated texts by altering the token probability distributions.
This watermarking algorithm alters the logits during generation, which can lead to a downgraded text quality.
We propose to improve the quality of texts generated by a watermarked language model by Watermarking with Importance Scoring (WIS)
arXiv Detail & Related papers (2023-11-16T08:36:00Z) - A Watermark for Large Language Models [84.95327142027183]
We propose a watermarking framework for proprietary language models.
The watermark can be embedded with negligible impact on text quality.
It can be detected using an efficient open-source algorithm without access to the language model API or parameters.
arXiv Detail & Related papers (2023-01-24T18:52:59Z) - Fine-tuning Is Not Enough: A Simple yet Effective Watermark Removal
Attack for DNN Models [72.9364216776529]
We propose a novel watermark removal attack from a different perspective.
We design a simple yet powerful transformation algorithm by combining imperceptible pattern embedding and spatial-level transformations.
Our attack can bypass state-of-the-art watermarking solutions with very high success rates.
arXiv Detail & Related papers (2020-09-18T09:14:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.