Related papers: Protecting Language Generation Models via Invisible Watermarking

Protecting Language Generation Models via Invisible Watermarking

URL: http://arxiv.org/abs/2302.03162v3
Date: Sun, 27 Aug 2023 00:13:09 GMT
Title: Protecting Language Generation Models via Invisible Watermarking
Authors: Xuandong Zhao, Yu-Xiang Wang, Lei Li
Abstract summary: We propose GINSEW, a novel method to protect text generation models from being stolen through distillation. Experimental results show that GINSEW can effectively identify instances of IP infringement with minimal impact on the generation quality of protected APIs.
Score: 41.532711376512744
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Language generation models have been an increasingly powerful enabler for many applications. Many such models offer free or affordable API access, which makes them potentially vulnerable to model extraction attacks through distillation. To protect intellectual property (IP) and ensure fair use of these models, various techniques such as lexical watermarking and synonym replacement have been proposed. However, these methods can be nullified by obvious countermeasures such as "synonym randomization". To address this issue, we propose GINSEW, a novel method to protect text generation models from being stolen through distillation. The key idea of our method is to inject secret signals into the probability vector of the decoding steps for each target token. We can then detect the secret message by probing a suspect model to tell if it is distilled from the protected one. Experimental results show that GINSEW can effectively identify instances of IP infringement with minimal impact on the generation quality of protected APIs. Our method demonstrates an absolute improvement of 19 to 29 points on mean average precision (mAP) in detecting suspects compared to previous methods against watermark removal attacks.

Related papers

Lossless Copyright Protection via Intrinsic Model Fingerprinting [21.898748690761874]
Existing protection methods modify the model to embed watermarks, which impairs performance.<n>We propose TrajPrint, a completely lossless and training-free framework that verifies model copyright by extracting unique manifold fingerprints.
arXiv Detail & Related papers (2026-01-29T04:18:07Z)
SWAP: Towards Copyright Auditing of Soft Prompts via Sequential Watermarking [58.475471437150674]
We propose sequential watermarking for soft prompts (SWAP)<n>SWAP encodes watermarks through a specific order of defender-specified out-of-distribution classes.<n>Experiments on 11 datasets demonstrate SWAP's effectiveness, harmlessness, and robustness against potential adaptive attacks.
arXiv Detail & Related papers (2025-11-05T13:48:48Z)
Character-Level Perturbations Disrupt LLM Watermarks [64.60090923837701]
We formalize the system model for Large Language Model (LLM) watermarking.<n>We characterize two realistic threat models constrained on limited access to the watermark detector.<n>We demonstrate character-level perturbations are significantly more effective for watermark removal under the most restrictive threat model.<n> Experiments confirm the superiority of character-level perturbations and the effectiveness of the Genetic Algorithm (GA) in removing watermarks under realistic constraints.
arXiv Detail & Related papers (2025-09-11T02:50:07Z)
Fundamental Limitations in Defending LLM Finetuning APIs [61.29028411001255]
We show that defences of fine-tuning APIs are fundamentally limited in their ability to prevent fine-tuning attacks. We construct 'pointwise-undetectable' attacks that repurpose entropy in benign model outputs to covertly transmit dangerous knowledge. We test our attacks against the OpenAI fine-tuning API, finding they succeed in eliciting answers to harmful multiple-choice questions.
arXiv Detail & Related papers (2025-02-20T18:45:01Z)
Robustness of Watermarking on Text-to-Image Diffusion Models [9.277492743469235]
We investigate the robustness of generative watermarking, which is created from the integration of watermarking embedding and text-to-image generation processing. We found that generative watermarking methods are robust to direct evasion attacks, like discriminator-based attacks, or manipulation based on the edge information in edge prediction-based attacks but vulnerable to malicious fine-tuning.
arXiv Detail & Related papers (2024-08-04T13:59:09Z)
QUEEN: Query Unlearning against Model Extraction [22.434812818540966]
Model extraction attacks pose a non-negligible threat to the security and privacy of deep learning models. We propose QUEEN (QUEry unlEarNing) that proactively launches counterattacks on potential model extraction attacks.
arXiv Detail & Related papers (2024-07-01T13:01:41Z)
Explanation as a Watermark: Towards Harmless and Multi-bit Model Ownership Verification via Watermarking Feature Attribution [22.933101948176606]
backdoor-based model watermarks are the primary and cutting-edge methods to implant such properties in released models. We design a new watermarking paradigm, $i.e.$, Explanation as a Watermark (EaaW), that implants verification behaviors into the explanation of feature attribution.
arXiv Detail & Related papers (2024-05-08T05:49:46Z)
Lazy Layers to Make Fine-Tuned Diffusion Models More Traceable [70.77600345240867]
A novel arbitrary-in-arbitrary-out (AIAO) strategy makes watermarks resilient to fine-tuning-based removal. Unlike the existing methods of designing a backdoor for the input/output space of diffusion models, in our method, we propose to embed the backdoor into the feature space of sampled subpaths. Our empirical studies on the MS-COCO, AFHQ, LSUN, CUB-200, and DreamBooth datasets confirm the robustness of AIAO.
arXiv Detail & Related papers (2024-05-01T12:03:39Z)
Probabilistically Robust Watermarking of Neural Networks [4.332441337407564]
We introduce a novel trigger set-based watermarking approach that demonstrates resilience against functionality stealing attacks. Our approach does not require additional model training and can be applied to any model architecture.
arXiv Detail & Related papers (2024-01-16T10:32:13Z)
Towards Robust Model Watermark via Reducing Parametric Vulnerability [57.66709830576457]
backdoor-based ownership verification becomes popular recently, in which the model owner can watermark the model. We propose a mini-max formulation to find these watermark-removed models and recover their watermark behavior. Our method improves the robustness of the model watermarking against parametric changes and numerous watermark-removal attacks.
arXiv Detail & Related papers (2023-09-09T12:46:08Z)
Safe and Robust Watermark Injection with a Single OoD Image [90.71804273115585]
Training a high-performance deep neural network requires large amounts of data and computational resources. We propose a safe and robust backdoor-based watermark injection technique. We induce random perturbation of model parameters during watermark injection to defend against common watermark removal attacks.
arXiv Detail & Related papers (2023-09-04T19:58:35Z)
A Watermark for Large Language Models [84.95327142027183]
We propose a watermarking framework for proprietary language models. The watermark can be embedded with negligible impact on text quality. It can be detected using an efficient open-source algorithm without access to the language model API or parameters.
arXiv Detail & Related papers (2023-01-24T18:52:59Z)
Distillation-Resistant Watermarking for Model Protection in NLP [36.37616789197548]
We propose Distillation-Resistant Watermarking (DRW) to protect NLP models from being stolen via distillation. DRW protects a model by injecting watermarks into the victim's prediction probability corresponding to a secret key. We evaluate DRW on a diverse set of NLP tasks including text classification, part-of-speech tagging, and named entity recognition.
arXiv Detail & Related papers (2022-10-07T04:14:35Z)
Open-sourced Dataset Protection via Backdoor Watermarking [87.15630326131901]
We propose a emphbackdoor embedding based dataset watermarking method to protect an open-sourced image-classification dataset. We use a hypothesis test guided method for dataset verification based on the posterior probability generated by the suspicious third-party model.
arXiv Detail & Related papers (2020-10-12T16:16:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.