Related papers: Can LLM Watermarks Robustly Prevent Unauthorized Knowledge Distillation?

Can LLM Watermarks Robustly Prevent Unauthorized Knowledge Distillation?

URL: http://arxiv.org/abs/2502.11598v1
Date: Mon, 17 Feb 2025 09:34:19 GMT
Title: Can LLM Watermarks Robustly Prevent Unauthorized Knowledge Distillation?
Authors: Leyi Pan, Aiwei Liu, Shiyu Huang, Yijian Lu, Xuming Hu, Lijie Wen, Irwin King, Philip S. Yu,
Abstract summary: This paper investigates whether student models can acquire the capabilities of teacher models through knowledge distillation while avoiding watermark inheritance.<n>We propose two categories of watermark removal approaches: pre-distillation removal through untargeted and targeted training data paraphrasing (UP and TP), and post-distillation removal through inference-time watermark neutralization (WN)
Score: 75.99961894619986
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The radioactive nature of Large Language Model (LLM) watermarking enables the detection of watermarks inherited by student models when trained on the outputs of watermarked teacher models, making it a promising tool for preventing unauthorized knowledge distillation. However, the robustness of watermark radioactivity against adversarial actors remains largely unexplored. In this paper, we investigate whether student models can acquire the capabilities of teacher models through knowledge distillation while avoiding watermark inheritance. We propose two categories of watermark removal approaches: pre-distillation removal through untargeted and targeted training data paraphrasing (UP and TP), and post-distillation removal through inference-time watermark neutralization (WN). Extensive experiments across multiple model pairs, watermarking schemes and hyper-parameter settings demonstrate that both TP and WN thoroughly eliminate inherited watermarks, with WN achieving this while maintaining knowledge transfer efficiency and low computational overhead. Given the ongoing deployment of watermarking techniques in production LLMs, these findings emphasize the urgent need for more robust defense strategies. Our code is available at https://github.com/THU-BPM/Watermark-Radioactivity-Attack.

Related papers

DeepTracer: Tracing Stolen Model via Deep Coupled Watermarks [14.552367035706283]
We introduce a robust watermarking framework, DeepTracer, which leverages a novel watermark samples construction method and a same-class coupling loss constraint.<n>DeepTracer can incur a high-coupling model between watermark task and primary task that makes adversaries learn the hidden watermark task when stealing the primary task functionality.<n>We propose an effective watermark samples filtering mechanism that elaborately select watermark key samples used in model ownership verification to enhance the reliability of watermarks.
arXiv Detail & Related papers (2025-11-12T05:06:25Z)
Radioactive Watermarks in Diffusion and Autoregressive Image Generative Models [4.303412065407284]
We analyze the radioactivity of watermarks in images generated by diffusion models (DMs) and image autoregressive models (IARs)<n>We propose the first watermarking method tailored for IARs and with radioactivity in mind.
arXiv Detail & Related papers (2025-06-30T11:08:10Z)
Towards Dataset Copyright Evasion Attack against Personalized Text-to-Image Diffusion Models [52.877452505561706]
We propose the first copyright evasion attack specifically designed to undermine dataset ownership verification (DOV)<n>Our CEAT2I comprises three stages: watermarked sample detection, trigger identification, and efficient watermark mitigation.<n>Our experiments show that our CEAT2I effectively evades DOV mechanisms while preserving model performance.
arXiv Detail & Related papers (2025-05-05T17:51:55Z)
Unified Attacks to Large Language Model Watermarks: Spoofing and Scrubbing in Unauthorized Knowledge Distillation [18.226065936358502]
We propose Contrastive Decoding-Guided Knowledge Distillation (CDG-KD) as a unified framework that enables bidirectional attacks under unauthorized knowledge distillation. Our approach employs contrastive decoding to extract corrupted or amplified watermark texts via comparing outputs from the student model and weakly watermarked references. Our findings underscore critical need for developing watermarking schemes that are robust and unforgeable.
arXiv Detail & Related papers (2025-04-24T12:15:46Z)
ROBIN: Robust and Invisible Watermarks for Diffusion Models with Adversarial Optimization [15.570148419846175]
Existing watermarking methods face the challenge of balancing robustness and concealment. This paper introduces a watermark hiding process to actively achieve concealment, thus allowing the embedding of stronger watermarks. Experiments on various diffusion models demonstrate the watermark remains verifiable even under significant image tampering.
arXiv Detail & Related papers (2024-11-06T12:14:23Z)
Turning Your Strength into Watermark: Watermarking Large Language Model via Knowledge Injection [66.26348985345776]
We propose a novel watermarking method for large language models (LLMs) based on knowledge injection. In the watermark embedding stage, we first embed the watermarks into the selected knowledge to obtain the watermarked knowledge. In the watermark extraction stage, questions related to the watermarked knowledge are designed, for querying the suspect LLM. Experiments show that the watermark extraction success rate is close to 100% and demonstrate the effectiveness, fidelity, stealthiness, and robustness of our proposed method.
arXiv Detail & Related papers (2023-11-16T03:22:53Z)
Robustness of AI-Image Detectors: Fundamental Limits and Practical Attacks [47.04650443491879]
We analyze the robustness of various AI-image detectors including watermarking and deepfake detectors. We show that watermarking methods are vulnerable to spoofing attacks where the attacker aims to have real images identified as watermarked ones.
arXiv Detail & Related papers (2023-09-29T18:30:29Z)
Unbiased Watermark for Large Language Models [67.43415395591221]
This study examines how significantly watermarks impact the quality of model-generated outputs. It is possible to integrate watermarks without affecting the output probability distribution. The presence of watermarks does not compromise the performance of the model in downstream tasks.
arXiv Detail & Related papers (2023-09-22T12:46:38Z)
Towards Robust Model Watermark via Reducing Parametric Vulnerability [57.66709830576457]
backdoor-based ownership verification becomes popular recently, in which the model owner can watermark the model. We propose a mini-max formulation to find these watermark-removed models and recover their watermark behavior. Our method improves the robustness of the model watermarking against parametric changes and numerous watermark-removal attacks.
arXiv Detail & Related papers (2023-09-09T12:46:08Z)
Certified Neural Network Watermarks with Randomized Smoothing [64.86178395240469]
We propose a certifiable watermarking method for deep learning models. We show that our watermark is guaranteed to be unremovable unless the model parameters are changed by more than a certain l2 threshold. Our watermark is also empirically more robust compared to previous watermarking methods.
arXiv Detail & Related papers (2022-07-16T16:06:59Z)
Fine-tuning Is Not Enough: A Simple yet Effective Watermark Removal Attack for DNN Models [72.9364216776529]
We propose a novel watermark removal attack from a different perspective. We design a simple yet powerful transformation algorithm by combining imperceptible pattern embedding and spatial-level transformations. Our attack can bypass state-of-the-art watermarking solutions with very high success rates.
arXiv Detail & Related papers (2020-09-18T09:14:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.