Related papers: Beyond Dataset Watermarking: Model-Level Copyright Protection for Code Summarization Models

Beyond Dataset Watermarking: Model-Level Copyright Protection for Code Summarization Models

URL: http://arxiv.org/abs/2410.14102v1
Date: Fri, 18 Oct 2024 00:48:00 GMT
Title: Beyond Dataset Watermarking: Model-Level Copyright Protection for Code Summarization Models
Authors: Jiale Zhang, Haoxuan Li, Di Wu, Xiaobing Sun, Qinghua Lu, Guodong Long,
Abstract summary: CSMs face risks of exploitation by unauthorized users. Traditional watermarking methods require separate design of triggers and watermark features. We propose ModMark, a novel model-level digital watermark embedding method.
Score: 37.817691840557984
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Code Summarization Model (CSM) has been widely used in code production, such as online and web programming for PHP and Javascript. CSMs are essential tools in code production, enhancing software development efficiency and driving innovation in automated code analysis. However, CSMs face risks of exploitation by unauthorized users, particularly in an online environment where CSMs can be easily shared and disseminated. To address these risks, digital watermarks offer a promising solution by embedding imperceptible signatures within the models to assert copyright ownership and track unauthorized usage. Traditional watermarking for CSM copyright protection faces two main challenges: 1) dataset watermarking methods require separate design of triggers and watermark features based on the characteristics of different programming languages, which not only increases the computation complexity but also leads to a lack of generalization, 2) existing watermarks based on code style transformation are easily identifiable by automated detection, demonstrating poor concealment. To tackle these issues, we propose ModMark , a novel model-level digital watermark embedding method. Specifically, by fine-tuning the tokenizer, ModMark achieves cross-language generalization while reducing the complexity of watermark design. Moreover, we employ code noise injection techniques to effectively prevent trigger detection. Experimental results show that our method can achieve 100% watermark verification rate across various programming languages' CSMs, and the concealment and effectiveness of ModMark can also be guaranteed.

Related papers

Hot-Swap MarkBoard: An Efficient Black-box Watermarking Approach for Large-scale Model Distribution [14.60627694687767]
We propose Hot-Swap MarkBoard, an efficient watermarking method.<n>It encodes user-specific $n$-bit binary signatures by independently embedding multiple watermarks.<n>The method supports black-box verification and is compatible with various model architectures.
arXiv Detail & Related papers (2025-07-28T09:14:21Z)
TAG-WM: Tamper-Aware Generative Image Watermarking via Diffusion Inversion Sensitivity [68.95168727940973]
Tamper-Aware Generative image WaterMarking method named TAG-WM.<n>This paper proposes a Tamper-Aware Generative image WaterMarking method named TAG-WM.
arXiv Detail & Related papers (2025-06-30T03:14:07Z)
Towards Generalized and Stealthy Watermarking for Generative Code Models [35.78974773421725]
Experimental results demonstrate that CodeGuard achieves up to 100% watermark verification rates in both code summarization and code generation tasks.<n>In terms of stealthiness, CodeGuard performs exceptionally, with a maximum detection rate of only 0.078 against ONION detection methods.
arXiv Detail & Related papers (2025-06-26T01:14:35Z)
Robust and Secure Code Watermarking for Large Language Models via ML/Crypto Codesign [15.153228808457628]
RoSeMary regulates LLM-generated code to avoid intellectual property rights violations and inappropriate misuse in software development. High-quality watermarks adhering to the detectability-fidelity-robustness tri-objective are limited due to codes' low-entropy nature. RoSeMary achieves high detection accuracy while preserving the code functionality.
arXiv Detail & Related papers (2025-02-04T07:35:28Z)
Watermarking Large Language Models and the Generated Content: Opportunities and Challenges [18.01886375229288]
generative large language models (LLMs) have raised concerns about intellectual property rights violations and the spread of machine-generated misinformation. Watermarking serves as a promising approch to establish ownership, prevent unauthorized use, and trace the origins of LLM-generated content. This paper summarizes and shares the challenges and opportunities we found when watermarking LLMs.
arXiv Detail & Related papers (2024-10-24T18:55:33Z)
ESpeW: Robust Copyright Protection for LLM-based EaaS via Embedding-Specific Watermark [50.08021440235581]
Embeds as a Service (Eding) is emerging as a crucial role in AI applications. Eding is vulnerable to model extraction attacks, highlighting the urgent need for copyright protection. We propose a novel embedding-specific watermarking (ESpeW) mechanism to offer robust copyright protection for Eding.
arXiv Detail & Related papers (2024-10-23T04:34:49Z)
De-mark: Watermark Removal in Large Language Models [59.00698153097887]
We present De-mark, an advanced framework designed to remove n-gram-based watermarks effectively. Our method utilizes a novel querying strategy, termed random selection probing, which aids in assessing the strength of the watermark.
arXiv Detail & Related papers (2024-10-17T17:42:10Z)
DIP-Watermark: A Double Identity Protection Method Based on Robust Adversarial Watermark [13.007649270429493]
Face Recognition (FR) systems pose privacy risks. One countermeasure is adversarial attack, deceiving unauthorized malicious FR. We propose the first double identity protection scheme based on traceable adversarial watermarking.
arXiv Detail & Related papers (2024-04-23T02:50:38Z)
Is The Watermarking Of LLM-Generated Code Robust? [5.48277165801539]
We show that watermarking techniques are significantly more fragile in code-based contexts. Specifically, we show that simple semantic-preserving transformations, such as variable renaming and dead code insertion, can effectively erase watermarks.
arXiv Detail & Related papers (2024-03-24T21:41:29Z)
No Free Lunch in LLM Watermarking: Trade-offs in Watermarking Design Choices [20.20770405297239]
We show that common design choices in LLM watermarking schemes make the resulting systems surprisingly susceptible to attack. We propose guidelines and defenses for LLM watermarking in practice.
arXiv Detail & Related papers (2024-02-25T20:24:07Z)
ClearMark: Intuitive and Robust Model Watermarking via Transposed Model Training [50.77001916246691]
This paper introduces ClearMark, the first DNN watermarking method designed for intuitive human assessment. ClearMark embeds visible watermarks, enabling human decision-making without rigid value thresholds. It shows an 8,544-bit watermark capacity comparable to the strongest existing work.
arXiv Detail & Related papers (2023-10-25T08:16:55Z)
Who Wrote this Code? Watermarking for Code Generation [53.24895162874416]
We propose Selective WatErmarking via Entropy Thresholding (SWEET) to detect machine-generated text. Our experiments show that SWEET significantly improves code quality preservation while outperforming all baselines.
arXiv Detail & Related papers (2023-05-24T11:49:52Z)
Fine-tuning Is Not Enough: A Simple yet Effective Watermark Removal Attack for DNN Models [72.9364216776529]
We propose a novel watermark removal attack from a different perspective. We design a simple yet powerful transformation algorithm by combining imperceptible pattern embedding and spatial-level transformations. Our attack can bypass state-of-the-art watermarking solutions with very high success rates.
arXiv Detail & Related papers (2020-09-18T09:14:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.