Related papers: WAPITI: A Watermark for Finetuned Open-Source LLMs

WAPITI: A Watermark for Finetuned Open-Source LLMs

URL: http://arxiv.org/abs/2410.06467v1
Date: Wed, 9 Oct 2024 01:41:14 GMT
Title: WAPITI: A Watermark for Finetuned Open-Source LLMs
Authors: Lingjie Chen, Ruizhong Qiu, Siyu Yuan, Zhining Liu, Tianxin Wei, Hyunsik Yoo, Zhichen Zeng, Deqing Yang, Hanghang Tong,
Abstract summary: WAPITI is a new method that transfers watermarking from base models to fine-tuned models through parameter integration. We show that our method can successfully inject watermarks and is highly compatible with fine-tuned models.
Score: 42.1087852764299
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Watermarking of large language models (LLMs) generation embeds an imperceptible statistical pattern within texts, making it algorithmically detectable. Watermarking is a promising method for addressing potential harm and biases from LLMs, as it enables traceability, accountability, and detection of manipulated content, helping to mitigate unintended consequences. However, for open-source models, watermarking faces two major challenges: (i) incompatibility with fine-tuned models, and (ii) vulnerability to fine-tuning attacks. In this work, we propose WAPITI, a new method that transfers watermarking from base models to fine-tuned models through parameter integration. To the best of our knowledge, we propose the first watermark for fine-tuned open-source LLMs that preserves their fine-tuned capabilities. Furthermore, our approach offers an effective defense against fine-tuning attacks. We test our method on various model architectures and watermarking strategies. Results demonstrate that our method can successfully inject watermarks and is highly compatible with fine-tuned models. Additionally, we offer an in-depth analysis of how parameter editing influences the watermark strength and overall capabilities of the resulting models.

Related papers

Hot-Swap MarkBoard: An Efficient Black-box Watermarking Approach for Large-scale Model Distribution [14.60627694687767]
We propose Hot-Swap MarkBoard, an efficient watermarking method.<n>It encodes user-specific $n$-bit binary signatures by independently embedding multiple watermarks.<n>The method supports black-box verification and is compatible with various model architectures.
arXiv Detail & Related papers (2025-07-28T09:14:21Z)
Optimization-Free Universal Watermark Forgery with Regenerative Diffusion Models [50.73220224678009]
Watermarking can be used to verify the origin of synthetic images generated by artificial intelligence models.<n>Recent studies demonstrate the capability to forge watermarks from a target image onto cover images via adversarial techniques.<n>In this paper, we uncover a greater risk of an optimization-free and universal watermark forgery.<n>Our approach significantly broadens the scope of attacks, presenting a greater challenge to the security of current watermarking techniques.
arXiv Detail & Related papers (2025-06-06T12:08:02Z)
Can you Finetune your Binoculars? Embedding Text Watermarks into the Weights of Large Language Models [33.051248579713736]
Indistinguishability of AI-generated content from human text raises challenges in transparency and accountability. We propose a strategy to finetune a pair of low-rank adapters of a model, one serving as the text-generating model, and the other as the detector. In this way, the watermarking strategy is fully learned end-to-end.
arXiv Detail & Related papers (2025-04-08T21:34:02Z)
Provably Robust Watermarks for Open-Source Language Models [5.509756888700397]
We introduce the first watermarking scheme for open-source language models. Our scheme works by modifying the parameters of the model, but the watermark can be detected from just the outputs of the model. Perhaps surprisingly, we prove that our watermarks are unremovable under certain assumptions about the adversary's knowledge.
arXiv Detail & Related papers (2024-10-24T15:44:34Z)
Robustness of Watermarking on Text-to-Image Diffusion Models [9.277492743469235]
We investigate the robustness of generative watermarking, which is created from the integration of watermarking embedding and text-to-image generation processing. We found that generative watermarking methods are robust to direct evasion attacks, like discriminator-based attacks, or manipulation based on the edge information in edge prediction-based attacks but vulnerable to malicious fine-tuning.
arXiv Detail & Related papers (2024-08-04T13:59:09Z)
AquaLoRA: Toward White-box Protection for Customized Stable Diffusion Models via Watermark LoRA [67.68750063537482]
Diffusion models have achieved remarkable success in generating high-quality images. Recent works aim to let SD models output watermarked content for post-hoc forensics. We propose textttmethod as the first implementation under this scenario.
arXiv Detail & Related papers (2024-05-18T01:25:47Z)
ModelShield: Adaptive and Robust Watermark against Model Extraction Attack [58.46326901858431]
Large language models (LLMs) demonstrate general intelligence across a variety of machine learning tasks. adversaries can still utilize model extraction attacks to steal the model intelligence encoded in model generation. Watermarking technology offers a promising solution for defending against such attacks by embedding unique identifiers into the model-generated content.
arXiv Detail & Related papers (2024-05-03T06:41:48Z)
Wide Flat Minimum Watermarking for Robust Ownership Verification of GANs [23.639074918667625]
We propose a novel multi-bit box-free watermarking method for GANs with improved robustness against white-box attacks. The watermark is embedded by adding an extra watermarking loss term during GAN training. We show that the presence of the watermark has a negligible impact on the quality of the generated images.
arXiv Detail & Related papers (2023-10-25T18:38:10Z)
Unbiased Watermark for Large Language Models [67.43415395591221]
This study examines how significantly watermarks impact the quality of model-generated outputs. It is possible to integrate watermarks without affecting the output probability distribution. The presence of watermarks does not compromise the performance of the model in downstream tasks.
arXiv Detail & Related papers (2023-09-22T12:46:38Z)
Towards Robust Model Watermark via Reducing Parametric Vulnerability [57.66709830576457]
backdoor-based ownership verification becomes popular recently, in which the model owner can watermark the model. We propose a mini-max formulation to find these watermark-removed models and recover their watermark behavior. Our method improves the robustness of the model watermarking against parametric changes and numerous watermark-removal attacks.
arXiv Detail & Related papers (2023-09-09T12:46:08Z)
Fine-tuning Is Not Enough: A Simple yet Effective Watermark Removal Attack for DNN Models [72.9364216776529]
We propose a novel watermark removal attack from a different perspective. We design a simple yet powerful transformation algorithm by combining imperceptible pattern embedding and spatial-level transformations. Our attack can bypass state-of-the-art watermarking solutions with very high success rates.
arXiv Detail & Related papers (2020-09-18T09:14:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.