On the Learnability of Watermarks for Language Models
- URL: http://arxiv.org/abs/2312.04469v3
- Date: Thu, 2 May 2024 07:05:56 GMT
- Title: On the Learnability of Watermarks for Language Models
- Authors: Chenchen Gu, Xiang Lisa Li, Percy Liang, Tatsunori Hashimoto,
- Abstract summary: We ask whether language models can directly learn to generate watermarked text.
We propose watermark distillation, which trains a student model to behave like a teacher model.
We find that models can learn to generate watermarked text with high detectability.
- Score: 80.97358663708592
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Watermarking of language model outputs enables statistical detection of model-generated text, which can mitigate harms and misuses of language models. Existing watermarking strategies operate by altering the decoder of an existing language model. In this paper, we ask whether language models can directly learn to generate watermarked text, which would have significant implications for the real-world deployment of watermarks. First, learned watermarks could be used to build open models that naturally generate watermarked text, enabling watermarking for open models, where users can control the decoding procedure. Second, if watermarking is used to determine the provenance of generated text, an adversary can hurt the reputation of a victim model by spoofing its watermark and generating damaging watermarked text. To investigate the learnability of watermarks, we propose watermark distillation, which trains a student model to behave like a teacher model that uses decoding-based watermarking. We test our approach on three decoding-based watermarking strategies and various hyperparameter settings, finding that models can learn to generate watermarked text with high detectability. We also find limitations to learnability, including the loss of watermarking capabilities under fine-tuning on normal text and high sample complexity when learning low-distortion watermarks.
Related papers
- Watermark Smoothing Attacks against Language Models [40.02225709485305]
We introduce smoothing attacks and show that existing watermarking methods are not robust against minor modifications of text.
Our attack reveals a fundamental limitation of a wide range of watermarking techniques.
arXiv Detail & Related papers (2024-07-19T11:04:54Z) - Multi-Bit Distortion-Free Watermarking for Large Language Models [4.7381853007029475]
We extend an existing zero-bit distortion-free watermarking method by embedding multiple bits of meta-information as part of the watermark.
We also develop a computationally efficient decoder that extracts the embedded information from the watermark with low bit error rate.
arXiv Detail & Related papers (2024-02-26T14:01:34Z) - Mark My Words: Analyzing and Evaluating Language Model Watermarks [8.025719866615333]
This work focuses on output watermarking techniques, as opposed to image or model watermarks.
We focus on three main metrics: quality, size (i.e., the number of tokens needed to detect a watermark), and tamper resistance.
arXiv Detail & Related papers (2023-12-01T01:22:46Z) - Improving the Generation Quality of Watermarked Large Language Models
via Word Importance Scoring [81.62249424226084]
Token-level watermarking inserts watermarks in the generated texts by altering the token probability distributions.
This watermarking algorithm alters the logits during generation, which can lead to a downgraded text quality.
We propose to improve the quality of texts generated by a watermarked language model by Watermarking with Importance Scoring (WIS)
arXiv Detail & Related papers (2023-11-16T08:36:00Z) - Unbiased Watermark for Large Language Models [67.43415395591221]
This study examines how significantly watermarks impact the quality of model-generated outputs.
It is possible to integrate watermarks without affecting the output probability distribution.
The presence of watermarks does not compromise the performance of the model in downstream tasks.
arXiv Detail & Related papers (2023-09-22T12:46:38Z) - On the Reliability of Watermarks for Large Language Models [95.87476978352659]
We study the robustness of watermarked text after it is re-written by humans, paraphrased by a non-watermarked LLM, or mixed into a longer hand-written document.
We find that watermarks remain detectable even after human and machine paraphrasing.
We also consider a range of new detection schemes that are sensitive to short spans of watermarked text embedded inside a large document.
arXiv Detail & Related papers (2023-06-07T17:58:48Z) - Undetectable Watermarks for Language Models [1.347733333991357]
We introduce a cryptographically-inspired notion of undetectable watermarks for language models.
watermarks can be detected only with the knowledge of a secret key.
We construct undetectable watermarks based on the existence of one-way functions.
arXiv Detail & Related papers (2023-05-25T02:57:16Z) - A Watermark for Large Language Models [84.95327142027183]
We propose a watermarking framework for proprietary language models.
The watermark can be embedded with negligible impact on text quality.
It can be detected using an efficient open-source algorithm without access to the language model API or parameters.
arXiv Detail & Related papers (2023-01-24T18:52:59Z) - Certified Neural Network Watermarks with Randomized Smoothing [64.86178395240469]
We propose a certifiable watermarking method for deep learning models.
We show that our watermark is guaranteed to be unremovable unless the model parameters are changed by more than a certain l2 threshold.
Our watermark is also empirically more robust compared to previous watermarking methods.
arXiv Detail & Related papers (2022-07-16T16:06:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.