Related papers: Unified Attacks to Large Language Model Watermarks: Spoofing and Scrubbing in Unauthorized Knowledge Distillation

Unified Attacks to Large Language Model Watermarks: Spoofing and Scrubbing in Unauthorized Knowledge Distillation

URL: http://arxiv.org/abs/2504.17480v1
Date: Thu, 24 Apr 2025 12:15:46 GMT
Title: Unified Attacks to Large Language Model Watermarks: Spoofing and Scrubbing in Unauthorized Knowledge Distillation
Authors: Xin Yi, Shunfan Zhengc, Linlin Wanga, Xiaoling Wang, Liang He,
Abstract summary: We propose Contrastive Decoding-Guided Knowledge Distillation (CDG-KD) as a unified framework that enables bidirectional attacks under unauthorized knowledge distillation.<n>Our approach employs contrastive decoding to extract corrupted or amplified watermark texts via comparing outputs from the student model and weakly watermarked references.<n>Our findings underscore critical need for developing watermarking schemes that are robust and unforgeable.
Score: 18.226065936358502
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Watermarking has emerged as a critical technique for combating misinformation and protecting intellectual property in large language models (LLMs). A recent discovery, termed watermark radioactivity, reveals that watermarks embedded in teacher models can be inherited by student models through knowledge distillation. On the positive side, this inheritance allows for the detection of unauthorized knowledge distillation by identifying watermark traces in student models. However, the robustness of watermarks against scrubbing attacks and their unforgeability in the face of spoofing attacks under unauthorized knowledge distillation remain largely unexplored. Existing watermark attack methods either assume access to model internals or fail to simultaneously support both scrubbing and spoofing attacks. In this work, we propose Contrastive Decoding-Guided Knowledge Distillation (CDG-KD), a unified framework that enables bidirectional attacks under unauthorized knowledge distillation. Our approach employs contrastive decoding to extract corrupted or amplified watermark texts via comparing outputs from the student model and weakly watermarked references, followed by bidirectional distillation to train new student models capable of watermark removal and watermark forgery, respectively. Extensive experiments show that CDG-KD effectively performs attacks while preserving the general performance of the distilled model. Our findings underscore critical need for developing watermarking schemes that are robust and unforgeable.

Related papers

DeepTracer: Tracing Stolen Model via Deep Coupled Watermarks [14.552367035706283]
We introduce a robust watermarking framework, DeepTracer, which leverages a novel watermark samples construction method and a same-class coupling loss constraint.<n>DeepTracer can incur a high-coupling model between watermark task and primary task that makes adversaries learn the hidden watermark task when stealing the primary task functionality.<n>We propose an effective watermark samples filtering mechanism that elaborately select watermark key samples used in model ownership verification to enhance the reliability of watermarks.
arXiv Detail & Related papers (2025-11-12T05:06:25Z)
DITTO: A Spoofing Attack Framework on Watermarked LLMs via Knowledge Distillation [7.366060554733598]
We introduce the threat of watermark spoofing, a sophisticated attack that allows a malicious model to generate text containing the authentic-looking watermark of a trusted, victim model.<n>By distilling knowledge from a watermarked teacher model, our framework allows an attacker to steal and replicate the watermarking signal of the victim model.<n>This work reveals a critical security gap in text authorship verification and calls for a paradigm shift towards technologies capable of distinguishing authentic watermarks from expertly imitated ones.
arXiv Detail & Related papers (2025-10-13T03:53:40Z)
Towards Dataset Copyright Evasion Attack against Personalized Text-to-Image Diffusion Models [52.877452505561706]
We propose the first copyright evasion attack specifically designed to undermine dataset ownership verification (DOV)<n>Our CEAT2I comprises three stages: watermarked sample detection, trigger identification, and efficient watermark mitigation.<n>Our experiments show that our CEAT2I effectively evades DOV mechanisms while preserving model performance.
arXiv Detail & Related papers (2025-05-05T17:51:55Z)
Can LLM Watermarks Robustly Prevent Unauthorized Knowledge Distillation? [75.99961894619986]
This paper investigates whether student models can acquire the capabilities of teacher models through knowledge distillation while avoiding watermark inheritance.<n>We propose two categories of watermark removal approaches: pre-distillation removal through untargeted and targeted training data paraphrasing (UP and TP), and post-distillation removal through inference-time watermark neutralization (WN)
arXiv Detail & Related papers (2025-02-17T09:34:19Z)
Robustness of Watermarking on Text-to-Image Diffusion Models [9.277492743469235]
We investigate the robustness of generative watermarking, which is created from the integration of watermarking embedding and text-to-image generation processing. We found that generative watermarking methods are robust to direct evasion attacks, like discriminator-based attacks, or manipulation based on the edge information in edge prediction-based attacks but vulnerable to malicious fine-tuning.
arXiv Detail & Related papers (2024-08-04T13:59:09Z)
Reliable Model Watermarking: Defending Against Theft without Compromising on Evasion [15.086451828825398]
evasion adversaries can readily exploit the shortcuts created by models memorizing watermark samples. By learning the model to accurately recognize them, unique watermark behaviors are promoted through knowledge injection.
arXiv Detail & Related papers (2024-04-21T03:38:20Z)
RAW: A Robust and Agile Plug-and-Play Watermark Framework for AI-Generated Images with Provable Guarantees [33.61946642460661]
This paper introduces a robust and agile watermark detection framework, dubbed as RAW. We employ a classifier that is jointly trained with the watermark to detect the presence of the watermark. We show that the framework provides provable guarantees regarding the false positive rate for misclassifying a watermarked image.
arXiv Detail & Related papers (2024-01-23T22:00:49Z)
On the Learnability of Watermarks for Language Models [80.97358663708592]
We ask whether language models can directly learn to generate watermarked text. We propose watermark distillation, which trains a student model to behave like a teacher model. We find that models can learn to generate watermarked text with high detectability.
arXiv Detail & Related papers (2023-12-07T17:41:44Z)
Robustness of AI-Image Detectors: Fundamental Limits and Practical Attacks [47.04650443491879]
We analyze the robustness of various AI-image detectors including watermarking and deepfake detectors. We show that watermarking methods are vulnerable to spoofing attacks where the attacker aims to have real images identified as watermarked ones.
arXiv Detail & Related papers (2023-09-29T18:30:29Z)
Towards Robust Model Watermark via Reducing Parametric Vulnerability [57.66709830576457]
backdoor-based ownership verification becomes popular recently, in which the model owner can watermark the model. We propose a mini-max formulation to find these watermark-removed models and recover their watermark behavior. Our method improves the robustness of the model watermarking against parametric changes and numerous watermark-removal attacks.
arXiv Detail & Related papers (2023-09-09T12:46:08Z)
Certified Neural Network Watermarks with Randomized Smoothing [64.86178395240469]
We propose a certifiable watermarking method for deep learning models. We show that our watermark is guaranteed to be unremovable unless the model parameters are changed by more than a certain l2 threshold. Our watermark is also empirically more robust compared to previous watermarking methods.
arXiv Detail & Related papers (2022-07-16T16:06:59Z)
Deep Model Intellectual Property Protection via Deep Watermarking [122.87871873450014]
Deep neural networks are exposed to serious IP infringement risks. Given a target deep model, if the attacker knows its full information, it can be easily stolen by fine-tuning. We propose a new model watermarking framework for protecting deep networks trained for low-level computer vision or image processing tasks.
arXiv Detail & Related papers (2021-03-08T18:58:21Z)
Fine-tuning Is Not Enough: A Simple yet Effective Watermark Removal Attack for DNN Models [72.9364216776529]
We propose a novel watermark removal attack from a different perspective. We design a simple yet powerful transformation algorithm by combining imperceptible pattern embedding and spatial-level transformations. Our attack can bypass state-of-the-art watermarking solutions with very high success rates.
arXiv Detail & Related papers (2020-09-18T09:14:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.