Watermarking Pre-trained Language Models with Backdooring
- URL: http://arxiv.org/abs/2210.07543v1
- Date: Fri, 14 Oct 2022 05:42:39 GMT
- Title: Watermarking Pre-trained Language Models with Backdooring
- Authors: Chenxi Gu, Chengsong Huang, Xiaoqing Zheng, Kai-Wei Chang, Cho-Jui
Hsieh
- Abstract summary: We show that PLMs can be watermarked with a multi-task learning framework by embedding backdoors triggered by specific inputs defined by the owners.
In addition to using some rare words as triggers, we also show that the combination of common words can be used as backdoor triggers to avoid them being easily detected.
- Score: 118.14981787949199
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large pre-trained language models (PLMs) have proven to be a crucial
component of modern natural language processing systems. PLMs typically need to
be fine-tuned on task-specific downstream datasets, which makes it hard to
claim the ownership of PLMs and protect the developer's intellectual property
due to the catastrophic forgetting phenomenon. We show that PLMs can be
watermarked with a multi-task learning framework by embedding backdoors
triggered by specific inputs defined by the owners, and those watermarks are
hard to remove even though the watermarked PLMs are fine-tuned on multiple
downstream tasks. In addition to using some rare words as triggers, we also
show that the combination of common words can be used as backdoor triggers to
avoid them being easily detected. Extensive experiments on multiple datasets
demonstrate that the embedded watermarks can be robustly extracted with a high
success rate and less influenced by the follow-up fine-tuning.
Related papers
- Waterfall: Framework for Robust and Scalable Text Watermarking and Provenance for LLMs [36.068335914828396]
We propose Waterfall, the first training-free framework for robust and scalable text watermarking.
Waterfall achieves significantly better scalability, robust verifiability, and computational efficiency compared to SOTA article-text watermarking methods.
arXiv Detail & Related papers (2024-07-05T10:51:33Z) - Large Language Models as Carriers of Hidden Messages [0.0]
Simple fine-tuning can embed hidden text into large language models (LLMs), which is revealed only when triggered by a specific query.
Our work demonstrates that embedding hidden text via fine-tuning, although seemingly secure due to the vast number of potential triggers, is vulnerable to extraction.
We introduce an extraction attack called Unconditional Token Forcing (UTF), which iteratively feeds tokens from the LLM's vocabulary to reveal sequences with high token probabilities, indicating hidden text candidates.
arXiv Detail & Related papers (2024-06-04T16:49:06Z) - Are you still on track!? Catching LLM Task Drift with Activations [55.75645403965326]
Task drift allows attackers to exfiltrate data or influence the LLM's output for other users.
We show that a simple linear classifier can detect drift with near-perfect ROC AUC on an out-of-distribution test set.
We observe that this approach generalizes surprisingly well to unseen task domains, such as prompt injections, jailbreaks, and malicious instructions.
arXiv Detail & Related papers (2024-06-02T16:53:21Z) - Large Language Model Watermark Stealing With Mixed Integer Programming [51.336009662771396]
Large Language Model (LLM) watermark shows promise in addressing copyright, monitoring AI-generated text, and preventing its misuse.
Recent research indicates that watermarking methods using numerous keys are susceptible to removal attacks.
We propose a novel green list stealing attack against the state-of-the-art LLM watermark scheme.
arXiv Detail & Related papers (2024-05-30T04:11:17Z) - WatME: Towards Lossless Watermarking Through Lexical Redundancy [58.61972059246715]
This study assesses the impact of watermarking on different capabilities of large language models (LLMs) from a cognitive science lens.
We introduce Watermarking with Mutual Exclusion (WatME) to seamlessly integrate watermarks.
arXiv Detail & Related papers (2023-11-16T11:58:31Z) - A Robust Semantics-based Watermark for Large Language Model against Paraphrasing [50.84892876636013]
Large language models (LLMs) have show great ability in various natural language tasks.
There are concerns that LLMs are possible to be used improperly or even illegally.
We propose a semantics-based watermark framework SemaMark.
arXiv Detail & Related papers (2023-11-15T06:19:02Z) - Are You Copying My Model? Protecting the Copyright of Large Language
Models for EaaS via Backdoor Watermark [58.60940048748815]
Companies have begun to offer Embedding as a Service (E) based on large language models (LLMs)
E is vulnerable to model extraction attacks, which can cause significant losses for the owners of LLMs.
We propose an Embedding Watermark method called EmbMarker that implants backdoors on embeddings.
arXiv Detail & Related papers (2023-05-17T08:28:54Z) - Removing Backdoor-Based Watermarks in Neural Networks with Limited Data [26.050649487499626]
Trading deep models is highly demanded and lucrative nowadays.
naive trading schemes typically involve potential risks related to copyright and trustworthiness issues.
We propose a novel backdoor-based watermark removal framework using limited data, dubbed WILD.
arXiv Detail & Related papers (2020-08-02T06:25:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.