Related papers: Watermarking Pre-trained Language Models with Backdooring

Watermarking Pre-trained Language Models with Backdooring

URL: http://arxiv.org/abs/2210.07543v1
Date: Fri, 14 Oct 2022 05:42:39 GMT
Title: Watermarking Pre-trained Language Models with Backdooring
Authors: Chenxi Gu, Chengsong Huang, Xiaoqing Zheng, Kai-Wei Chang, Cho-Jui Hsieh
Abstract summary: We show that PLMs can be watermarked with a multi-task learning framework by embedding backdoors triggered by specific inputs defined by the owners. In addition to using some rare words as triggers, we also show that the combination of common words can be used as backdoor triggers to avoid them being easily detected.
Score: 118.14981787949199
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large pre-trained language models (PLMs) have proven to be a crucial component of modern natural language processing systems. PLMs typically need to be fine-tuned on task-specific downstream datasets, which makes it hard to claim the ownership of PLMs and protect the developer's intellectual property due to the catastrophic forgetting phenomenon. We show that PLMs can be watermarked with a multi-task learning framework by embedding backdoors triggered by specific inputs defined by the owners, and those watermarks are hard to remove even though the watermarked PLMs are fine-tuned on multiple downstream tasks. In addition to using some rare words as triggers, we also show that the combination of common words can be used as backdoor triggers to avoid them being easily detected. Extensive experiments on multiple datasets demonstrate that the embedded watermarks can be robustly extracted with a high success rate and less influenced by the follow-up fine-tuning.

Related papers

Watermarking Needs Input Repetition Masking [13.309409725789433]
We show that both humans and Large Language Models (LLMs) end up mimicking, including the watermarking signal even in seemingly improbable settings. This challenges current academic assumptions and suggests that for long-term watermarking to be reliable, the likelihood of false positives needs to be significantly lower.
arXiv Detail & Related papers (2025-04-16T16:25:26Z)
Large Language Models Can Verbatim Reproduce Long Malicious Sequences [23.0516001201445]
Backdoor attacks on machine learning models have been extensively studied. This paper re-examines the concept of backdoor attacks in the context of Large Language Models. We find that arbitrary responses containing hard coded keys of $leq100$ random characters can be reproduced when triggered by a target input.
arXiv Detail & Related papers (2025-03-21T23:24:49Z)
Information-Guided Identification of Training Data Imprint in (Proprietary) Large Language Models [52.439289085318634]
We show how to identify training data known to proprietary large language models (LLMs) by using information-guided probes. Our work builds on a key observation: text passages with high surprisal are good search material for memorization probes.
arXiv Detail & Related papers (2025-03-15T10:19:15Z)
Robust Data Watermarking in Language Models by Injecting Fictitious Knowledge [31.766208360156906]
Data watermarking in language models injects traceable signals, such as token sequences or stylistic patterns, into copyrighted text. Previous data watermarking techniques primarily focus on effective memorization after pretraining. We propose a novel data watermarking approach that injects coherent and plausible yet fictitious knowledge into training data.
arXiv Detail & Related papers (2025-03-06T02:40:51Z)
Can LLM Watermarks Robustly Prevent Unauthorized Knowledge Distillation? [75.99961894619986]
This paper investigates whether student models can acquire the capabilities of teacher models through knowledge distillation while avoiding watermark inheritance. We propose two categories of watermark removal approaches: pre-distillation removal through untargeted and targeted training data paraphrasing (UP and TP), and post-distillation removal through inference-time watermark neutralization (WN)
arXiv Detail & Related papers (2025-02-17T09:34:19Z)
Waterfall: Framework for Robust and Scalable Text Watermarking and Provenance for LLMs [36.068335914828396]
We propose Waterfall, the first training-free framework for robust and scalable text watermarking. Waterfall achieves significantly better scalability, robust verifiability, and computational efficiency compared to SOTA article-text watermarking methods.
arXiv Detail & Related papers (2024-07-05T10:51:33Z)
Large Language Models as Carriers of Hidden Messages [0.0]
Simple fine-tuning can embed hidden text into large language models (LLMs), which is revealed only when triggered by a specific query. Our work demonstrates that embedding hidden text via fine-tuning, although seemingly secure due to the vast number of potential triggers, is vulnerable to extraction. We introduce an extraction attack called Unconditional Token Forcing (UTF), which iteratively feeds tokens from the LLM's vocabulary to reveal sequences with high token probabilities, indicating hidden text candidates.
arXiv Detail & Related papers (2024-06-04T16:49:06Z)
Are you still on track!? Catching LLM Task Drift with Activations [55.75645403965326]
Task drift allows attackers to exfiltrate data or influence the LLM's output for other users. We show that a simple linear classifier can detect drift with near-perfect ROC AUC on an out-of-distribution test set. We observe that this approach generalizes surprisingly well to unseen task domains, such as prompt injections, jailbreaks, and malicious instructions.
arXiv Detail & Related papers (2024-06-02T16:53:21Z)
Large Language Model Watermark Stealing With Mixed Integer Programming [51.336009662771396]
Large Language Model (LLM) watermark shows promise in addressing copyright, monitoring AI-generated text, and preventing its misuse. Recent research indicates that watermarking methods using numerous keys are susceptible to removal attacks. We propose a novel green list stealing attack against the state-of-the-art LLM watermark scheme.
arXiv Detail & Related papers (2024-05-30T04:11:17Z)
WatME: Towards Lossless Watermarking Through Lexical Redundancy [58.61972059246715]
This study assesses the impact of watermarking on different capabilities of large language models (LLMs) from a cognitive science lens. We introduce Watermarking with Mutual Exclusion (WatME) to seamlessly integrate watermarks.
arXiv Detail & Related papers (2023-11-16T11:58:31Z)
A Robust Semantics-based Watermark for Large Language Model against Paraphrasing [50.84892876636013]
Large language models (LLMs) have show great ability in various natural language tasks. There are concerns that LLMs are possible to be used improperly or even illegally. We propose a semantics-based watermark framework SemaMark.
arXiv Detail & Related papers (2023-11-15T06:19:02Z)
Are You Copying My Model? Protecting the Copyright of Large Language Models for EaaS via Backdoor Watermark [58.60940048748815]
Companies have begun to offer Embedding as a Service (E) based on large language models (LLMs) E is vulnerable to model extraction attacks, which can cause significant losses for the owners of LLMs. We propose an Embedding Watermark method called EmbMarker that implants backdoors on embeddings.
arXiv Detail & Related papers (2023-05-17T08:28:54Z)
Removing Backdoor-Based Watermarks in Neural Networks with Limited Data [26.050649487499626]
Trading deep models is highly demanded and lucrative nowadays. naive trading schemes typically involve potential risks related to copyright and trustworthiness issues. We propose a novel backdoor-based watermark removal framework using limited data, dubbed WILD.
arXiv Detail & Related papers (2020-08-02T06:25:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.