Related papers: Distillation-Resistant Watermarking for Model Protection in NLP

Distillation-Resistant Watermarking for Model Protection in NLP

URL: http://arxiv.org/abs/2210.03312v1
Date: Fri, 7 Oct 2022 04:14:35 GMT
Title: Distillation-Resistant Watermarking for Model Protection in NLP
Authors: Xuandong Zhao and Lei Li and Yu-Xiang Wang
Abstract summary: We propose Distillation-Resistant Watermarking (DRW) to protect NLP models from being stolen via distillation. DRW protects a model by injecting watermarks into the victim's prediction probability corresponding to a secret key. We evaluate DRW on a diverse set of NLP tasks including text classification, part-of-speech tagging, and named entity recognition.
Score: 36.37616789197548
License: http://creativecommons.org/licenses/by/4.0/
Abstract: How can we protect the intellectual property of trained NLP models? Modern NLP models are prone to stealing by querying and distilling from their publicly exposed APIs. However, existing protection methods such as watermarking only work for images but are not applicable to text. We propose Distillation-Resistant Watermarking (DRW), a novel technique to protect NLP models from being stolen via distillation. DRW protects a model by injecting watermarks into the victim's prediction probability corresponding to a secret key and is able to detect such a key by probing a suspect model. We prove that a protected model still retains the original accuracy within a certain bound. We evaluate DRW on a diverse set of NLP tasks including text classification, part-of-speech tagging, and named entity recognition. Experiments show that DRW protects the original model and detects stealing suspects at 100% mean average precision for all four tasks while the prior method fails on two.

Related papers

Lossless Copyright Protection via Intrinsic Model Fingerprinting [21.898748690761874]
Existing protection methods modify the model to embed watermarks, which impairs performance.<n>We propose TrajPrint, a completely lossless and training-free framework that verifies model copyright by extracting unique manifold fingerprints.
arXiv Detail & Related papers (2026-01-29T04:18:07Z)
SWAP: Towards Copyright Auditing of Soft Prompts via Sequential Watermarking [58.475471437150674]
We propose sequential watermarking for soft prompts (SWAP)<n>SWAP encodes watermarks through a specific order of defender-specified out-of-distribution classes.<n>Experiments on 11 datasets demonstrate SWAP's effectiveness, harmlessness, and robustness against potential adaptive attacks.
arXiv Detail & Related papers (2025-11-05T13:48:48Z)
Towards Dataset Copyright Evasion Attack against Personalized Text-to-Image Diffusion Models [52.877452505561706]
We propose the first copyright evasion attack specifically designed to undermine dataset ownership verification (DOV)<n>Our CEAT2I comprises three stages: watermarked sample detection, trigger identification, and efficient watermark mitigation.<n>Our experiments show that our CEAT2I effectively evades DOV mechanisms while preserving model performance.
arXiv Detail & Related papers (2025-05-05T17:51:55Z)
Adversarial Example Based Fingerprinting for Robust Copyright Protection in Split Learning [17.08424946015621]
We propose the first copyright protection scheme for Split Learning model, leveraging fingerprint to ensure effective and robust copyright protection. This is demonstrated by a remarkable fingerprint verification success rate (FVSR) of 100% on MNIST, 98% on CIFAR-10, and 100% on ImageNet.
arXiv Detail & Related papers (2025-03-05T06:07:16Z)
Lazy Layers to Make Fine-Tuned Diffusion Models More Traceable [70.77600345240867]
A novel arbitrary-in-arbitrary-out (AIAO) strategy makes watermarks resilient to fine-tuning-based removal. Unlike the existing methods of designing a backdoor for the input/output space of diffusion models, in our method, we propose to embed the backdoor into the feature space of sampled subpaths. Our empirical studies on the MS-COCO, AFHQ, LSUN, CUB-200, and DreamBooth datasets confirm the robustness of AIAO.
arXiv Detail & Related papers (2024-05-01T12:03:39Z)
Towards Robust Model Watermark via Reducing Parametric Vulnerability [57.66709830576457]
backdoor-based ownership verification becomes popular recently, in which the model owner can watermark the model. We propose a mini-max formulation to find these watermark-removed models and recover their watermark behavior. Our method improves the robustness of the model watermarking against parametric changes and numerous watermark-removal attacks.
arXiv Detail & Related papers (2023-09-09T12:46:08Z)
Safe and Robust Watermark Injection with a Single OoD Image [90.71804273115585]
Training a high-performance deep neural network requires large amounts of data and computational resources. We propose a safe and robust backdoor-based watermark injection technique. We induce random perturbation of model parameters during watermark injection to defend against common watermark removal attacks.
arXiv Detail & Related papers (2023-09-04T19:58:35Z)
Protecting Language Generation Models via Invisible Watermarking [41.532711376512744]
We propose GINSEW, a novel method to protect text generation models from being stolen through distillation. Experimental results show that GINSEW can effectively identify instances of IP infringement with minimal impact on the generation quality of protected APIs.
arXiv Detail & Related papers (2023-02-06T23:42:03Z)
ROSE: A RObust and SEcure DNN Watermarking [14.2215880080698]
This paper proposes a lightweight, robust, and secure black-box DNN watermarking protocol. It takes advantage of cryptographic one-way functions as well as the injection of in-task key image-label pairs during the training process.
arXiv Detail & Related papers (2022-06-22T12:46:14Z)
Be Careful about Poisoned Word Embeddings: Exploring the Vulnerability of the Embedding Layers in NLP Models [27.100909068228813]
Recent studies have revealed a security threat to natural language processing (NLP) models, called the Backdoor Attack. In this paper, we find that it is possible to hack the model in a data-free way by modifying one single word embedding vector. Experimental results on sentiment analysis and sentence-pair classification tasks show that our method is more efficient and stealthier.
arXiv Detail & Related papers (2021-03-29T12:19:45Z)
Robust Black-box Watermarking for Deep NeuralNetwork using Inverse Document Frequency [1.2502377311068757]
We propose a framework for watermarking a Deep Neural Networks (DNNs) model designed for a textual domain. The proposed embedding procedure takes place in the model's training time, making the watermark verification stage straightforward. The experimental results show that watermarked models have the same accuracy as the original ones.
arXiv Detail & Related papers (2021-03-09T17:56:04Z)
Open-sourced Dataset Protection via Backdoor Watermarking [87.15630326131901]
We propose a emphbackdoor embedding based dataset watermarking method to protect an open-sourced image-classification dataset. We use a hypothesis test guided method for dataset verification based on the posterior probability generated by the suspicious third-party model.
arXiv Detail & Related papers (2020-10-12T16:16:27Z)
Entangled Watermarks as a Defense against Model Extraction [42.74645868767025]
Entangled Watermarking Embeddings (EWE) are used to protect machine learning models fromExtraction attacks. EWE learns features for classifying data that is sampled from the task distribution and data that encodes watermarks. Experiments on MNIST, Fashion-MNIST, CIFAR-10, and Speech Commands validate that the defender can claim model ownership with 95% confidence with less than 100 queries to the stolen copy.
arXiv Detail & Related papers (2020-02-27T15:47:00Z)
Model Watermarking for Image Processing Networks [120.918532981871]
How to protect the intellectual property of deep models is a very important but seriously under-researched problem. We propose the first model watermarking framework for protecting image processing models.
arXiv Detail & Related papers (2020-02-25T18:36:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.