Fugu-MT 論文翻訳(概要): Uncovering Pretraining Code in LLMs: A Syntax-Aware Attribution Approach

論文の概要: Uncovering Pretraining Code in LLMs: A Syntax-Aware Attribution Approach

arxiv url: http://arxiv.org/abs/2511.07033v1
Date: Mon, 10 Nov 2025 12:29:09 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-11 21:18:45.24054
Title: Uncovering Pretraining Code in LLMs: A Syntax-Aware Attribution Approach
Title（参考訳）: LLMにおける事前学習コードの発見:構文対応属性アプローチ
Authors: Yuanheng Li, Zhuoyang Chen, Xiaoyun Liu, Yuhao Wang, Mingwei Liu, Yang Shi, Kaifeng Huang, Shengjie Zhao,
Abstract要約: オープンソースコードは、しばしばオープンソースライセンスによって保護されるが、事前トレーニングで使用する場合、法的および倫理的な課題を提起する。コードに適した構文決定型メンバシップ推論攻撃法であるSynPruneを提案する。
参考スコア（独自算出の注目度）: 20.775027150345107
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As large language models (LLMs) become increasingly capable, concerns over the unauthorized use of copyrighted and licensed content in their training data have grown, especially in the context of code. Open-source code, often protected by open source licenses (e.g, GPL), poses legal and ethical challenges when used in pretraining. Detecting whether specific code samples were included in LLM training data is thus critical for transparency, accountability, and copyright compliance. We propose SynPrune, a syntax-pruned membership inference attack method tailored for code. Unlike prior MIA approaches that treat code as plain text, SynPrune leverages the structured and rule-governed nature of programming languages. Specifically, it identifies and excludes consequent tokens that are syntactically required and not reflective of authorship, from attribution when computing membership scores. Experimental results show that SynPrune consistently outperforms the state-of-the-arts. Our method is also robust across varying function lengths and syntax categories.
Abstract（参考訳）: 大規模言語モデル(LLM)の能力が向上するにつれて、特にコードのコンテキストにおいて、トレーニングデータにおける著作権やライセンスされたコンテンツの不正使用に対する懸念が高まっている。オープンソースコードは、しばしばオープンソースライセンス(例えばGPL)によって保護されるが、事前トレーニングで使用する場合、法的および倫理的な課題を提起する。従って、LLMトレーニングデータに特定のコードサンプルが含まれているかどうかを検出することは、透明性、説明責任、著作権の遵守にとって重要である。コードに適した構文決定型メンバシップ推論攻撃法であるSynPruneを提案する。コードをプレーンテキストとして扱う以前のMIAアプローチとは異なり、SynPruneはプログラミング言語の構造的およびルール統治的な性質を活用する。具体的には、シンタクティックに必要であり、著者を反映していない後続のトークンを、メンバーシップスコアの計算時の属性から識別し、除外する。実験結果から、SynPruneは一貫して最先端よりも優れています。提案手法は,関数の長さや構文のカテゴリによっても頑健である。

論文の概要: Uncovering Pretraining Code in LLMs: A Syntax-Aware Attribution Approach

関連論文リスト