Fugu-MT 論文翻訳(概要): PuzzleMark: Implicit Jigsaw Learning for Robust Code Dataset Watermarking in Neural Code Completion Models

論文の概要: PuzzleMark: Implicit Jigsaw Learning for Robust Code Dataset Watermarking in Neural Code Completion Models

arxiv url: http://arxiv.org/abs/2604.27677v1
Date: Thu, 30 Apr 2026 10:10:11 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-01 16:31:54.038757
Title: PuzzleMark: Implicit Jigsaw Learning for Robust Code Dataset Watermarking in Neural Code Completion Models
Title（参考訳）: PuzzleMark: ニューラルコード補完モデルにおけるロバストなコードデータセットウォーターマーキングのためのJigsaw学習
Authors: Haocheng Huang, Yuchen Chen, Weisong Sun, Peizhuo Lv, Yuan Xiao, Chunrong Fang, Yang Liu, Xiaofang Zhang,
Abstract要約: 本稿では,コードデータセットに対するロバストな透かし手法であるPuzzleMarkを提案する。 PuzzleMarkは、それぞれ平均不審な$leq$0.24と、平均リコール$leq$30.41%である。
参考スコア（独自算出の注目度）: 19.242274533804842
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Constructing and curating high-quality code datasets requires significant resources, making them valuable intellectual property. Unfortunately, these datasets currently face severe risks of unauthorized use. Although digital watermarking offers a post hoc mechanism for copyright authentication, existing methods are predominantly based on the co-occurrence pattern, which is not robust and is susceptible to watermark detection and removal attacks. In this paper, we propose PuzzleMark, a robust watermarking method for code datasets. To reduce the risk of watermark exposure, PuzzleMark introduces a carrier selection strategy that leverages code complexity to evaluate the suitability of code snippets as watermark carriers, and selects those with high suitability for watermarking. To enhance the robustness of the watermark, PuzzleMark proposes a novel concatenation pattern to replace the traditional co-occurrence pattern, and implements two watermarking strategies through variable name concatenation. PuzzleMark adaptively embeds watermarks based on the inherent characteristics of the code, making it more stealthy while maintaining design simplicity. For watermark verification, PuzzleMark employs Fisher's exact test to verify suspicious models under a black-box setting. Experimental results demonstrate that PuzzleMark achieves a 100% verification success rate and a 0% false positive rate, with negligible impact on model performance. Both our human study and our evaluation using four state-of-the-art watermark detection methods show that PuzzleMark exhibits strong imperceptibility, with an average suspicious rate $\leq$ 0.24 and an average recall $\leq$ 30.41%, respectively. As a practical digital watermarking method, PuzzleMark provides strong protection for the intellectual property of code datasets and offers new insights for future research.
Abstract（参考訳）: 高品質なコードデータセットの構築とキュレーションには、重要なリソースが必要である。残念ながら、これらのデータセットは現在、不正使用の深刻なリスクに直面している。デジタル透かしは著作権認証のためのポストホックメカニズムを提供するが、既存の手法は主に共起パターンに基づいている。本稿では,コードデータセットに対するロバストな透かし手法であるPuzzleMarkを提案する。透かし露出のリスクを低減するため,PuzzleMarkでは,コードスニペットを透かしキャリアとして評価するために,コードの複雑さを活用するキャリア選択戦略を導入している。透かしの堅牢性を高めるため、PuzzleMark氏は従来の共起パターンを置き換える新しい結合パターンを提案し、変数名連結を通じて2つの透かし戦略を実装した。 PuzzleMarkは、コード固有の特性に基づいた透かしを適応的に埋め込み、設計の単純さを維持しながらよりステルス性を高めます。透かし検証のために、PuzzleMarkはフィッシャーの正確なテストを採用し、ブラックボックス設定下で不審なモデルを検証する。実験結果から,PuzzleMarkは100%の検証成功率,0%の偽陽性率を達成でき,モデル性能への影響は無視できることがわかった。我々の研究と4つの最先端透かし検出手法による評価の両方で、PuzzleMarkは強い非感受性を示し、平均不審率は0.24ドル、平均リコールは30.41%である。実用的なデジタル透かし方法として、PuzzleMarkは、コードデータセットの知的特性を強く保護し、将来の研究に新たな洞察を提供する。

論文の概要: PuzzleMark: Implicit Jigsaw Learning for Robust Code Dataset Watermarking in Neural Code Completion Models

関連論文リスト