Protecting Language Generation Models via Invisible Watermarking
- URL: http://arxiv.org/abs/2302.03162v3
- Date: Sun, 27 Aug 2023 00:13:09 GMT
- Title: Protecting Language Generation Models via Invisible Watermarking
- Authors: Xuandong Zhao, Yu-Xiang Wang, Lei Li
- Abstract summary: We propose GINSEW, a novel method to protect text generation models from being stolen through distillation.
Experimental results show that GINSEW can effectively identify instances of IP infringement with minimal impact on the generation quality of protected APIs.
- Score: 41.532711376512744
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language generation models have been an increasingly powerful enabler for
many applications. Many such models offer free or affordable API access, which
makes them potentially vulnerable to model extraction attacks through
distillation. To protect intellectual property (IP) and ensure fair use of
these models, various techniques such as lexical watermarking and synonym
replacement have been proposed. However, these methods can be nullified by
obvious countermeasures such as "synonym randomization". To address this issue,
we propose GINSEW, a novel method to protect text generation models from being
stolen through distillation. The key idea of our method is to inject secret
signals into the probability vector of the decoding steps for each target
token. We can then detect the secret message by probing a suspect model to tell
if it is distilled from the protected one. Experimental results show that
GINSEW can effectively identify instances of IP infringement with minimal
impact on the generation quality of protected APIs. Our method demonstrates an
absolute improvement of 19 to 29 points on mean average precision (mAP) in
detecting suspects compared to previous methods against watermark removal
attacks.
Related papers
- Fundamental Limitations in Defending LLM Finetuning APIs [61.29028411001255]
We show that defences of fine-tuning APIs are fundamentally limited in their ability to prevent fine-tuning attacks.
We construct 'pointwise-undetectable' attacks that repurpose entropy in benign model outputs to covertly transmit dangerous knowledge.
We test our attacks against the OpenAI fine-tuning API, finding they succeed in eliciting answers to harmful multiple-choice questions.
arXiv Detail & Related papers (2025-02-20T18:45:01Z) - Robustness of Watermarking on Text-to-Image Diffusion Models [9.277492743469235]
We investigate the robustness of generative watermarking, which is created from the integration of watermarking embedding and text-to-image generation processing.
We found that generative watermarking methods are robust to direct evasion attacks, like discriminator-based attacks, or manipulation based on the edge information in edge prediction-based attacks but vulnerable to malicious fine-tuning.
arXiv Detail & Related papers (2024-08-04T13:59:09Z) - QUEEN: Query Unlearning against Model Extraction [22.434812818540966]
Model extraction attacks pose a non-negligible threat to the security and privacy of deep learning models.
We propose QUEEN (QUEry unlEarNing) that proactively launches counterattacks on potential model extraction attacks.
arXiv Detail & Related papers (2024-07-01T13:01:41Z) - Explanation as a Watermark: Towards Harmless and Multi-bit Model Ownership Verification via Watermarking Feature Attribution [22.933101948176606]
backdoor-based model watermarks are the primary and cutting-edge methods to implant such properties in released models.
We design a new watermarking paradigm, $i.e.$, Explanation as a Watermark (EaaW), that implants verification behaviors into the explanation of feature attribution.
arXiv Detail & Related papers (2024-05-08T05:49:46Z) - Lazy Layers to Make Fine-Tuned Diffusion Models More Traceable [70.77600345240867]
A novel arbitrary-in-arbitrary-out (AIAO) strategy makes watermarks resilient to fine-tuning-based removal.
Unlike the existing methods of designing a backdoor for the input/output space of diffusion models, in our method, we propose to embed the backdoor into the feature space of sampled subpaths.
Our empirical studies on the MS-COCO, AFHQ, LSUN, CUB-200, and DreamBooth datasets confirm the robustness of AIAO.
arXiv Detail & Related papers (2024-05-01T12:03:39Z) - Probabilistically Robust Watermarking of Neural Networks [4.332441337407564]
We introduce a novel trigger set-based watermarking approach that demonstrates resilience against functionality stealing attacks.
Our approach does not require additional model training and can be applied to any model architecture.
arXiv Detail & Related papers (2024-01-16T10:32:13Z) - Towards Robust Model Watermark via Reducing Parametric Vulnerability [57.66709830576457]
backdoor-based ownership verification becomes popular recently, in which the model owner can watermark the model.
We propose a mini-max formulation to find these watermark-removed models and recover their watermark behavior.
Our method improves the robustness of the model watermarking against parametric changes and numerous watermark-removal attacks.
arXiv Detail & Related papers (2023-09-09T12:46:08Z) - Safe and Robust Watermark Injection with a Single OoD Image [90.71804273115585]
Training a high-performance deep neural network requires large amounts of data and computational resources.
We propose a safe and robust backdoor-based watermark injection technique.
We induce random perturbation of model parameters during watermark injection to defend against common watermark removal attacks.
arXiv Detail & Related papers (2023-09-04T19:58:35Z) - A Watermark for Large Language Models [84.95327142027183]
We propose a watermarking framework for proprietary language models.
The watermark can be embedded with negligible impact on text quality.
It can be detected using an efficient open-source algorithm without access to the language model API or parameters.
arXiv Detail & Related papers (2023-01-24T18:52:59Z) - Distillation-Resistant Watermarking for Model Protection in NLP [36.37616789197548]
We propose Distillation-Resistant Watermarking (DRW) to protect NLP models from being stolen via distillation.
DRW protects a model by injecting watermarks into the victim's prediction probability corresponding to a secret key.
We evaluate DRW on a diverse set of NLP tasks including text classification, part-of-speech tagging, and named entity recognition.
arXiv Detail & Related papers (2022-10-07T04:14:35Z) - Open-sourced Dataset Protection via Backdoor Watermarking [87.15630326131901]
We propose a emphbackdoor embedding based dataset watermarking method to protect an open-sourced image-classification dataset.
We use a hypothesis test guided method for dataset verification based on the posterior probability generated by the suspicious third-party model.
arXiv Detail & Related papers (2020-10-12T16:16:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.