Fugu-MT 論文翻訳(概要): Efficient and Scalable Provenance Tracking for LLM-Generated Code Snippets

論文の概要: Efficient and Scalable Provenance Tracking for LLM-Generated Code Snippets

arxiv url: http://arxiv.org/abs/2605.28510v1
Date: Wed, 27 May 2026 14:12:17 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-28 17:38:56.098975
Title: Efficient and Scalable Provenance Tracking for LLM-Generated Code Snippets
Title（参考訳）: LLM生成コードスニペットの効率的かつスケーラブルなプロバナンストラッキング
Authors: Andrea Gurioli, Davide D'Ascenzo, Federico Pennino, Maurizio Gabbrielli, Stefano Zacchiroli,
Abstract要約: コード補完と生成のための大規模言語モデル(LLM)は、冗長で権威の帰属のないトレーニング例を再現することができる。 Winnowingのような指紋認証に基づく古典的な指紋認証は依然として有効である。コード検索に適した300M-パラメータエンコーダと,ハイブリッドな2段階のプロファイランス追跡パイプラインを導入する。
参考スコア（独自算出の注目度）: 3.5312864406384485
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) for code completion and generation are increasingly used in software development, yet they may reproduce training examples verbatim and without authorship attribution, raising legal and ethical concerns around plagiarism and license compliance. Classical fingerprint-based plagiarism detectors based on fingerprinting, such as Winnowing, remain highly effective, yet the inspection requires comparing fragments of code to the entire training set, and their linear-time search makes them impractical for the billion-scale corpora used to train modern code LLMs. To bridge this gap, we introduce SOURCETRACKER, a 300M-parameter encoder tailored for code retrieval, together with a hybrid two-stage provenance-tracking pipeline HYBRIDSOURCETRACKER (HST). HST first narrows down a small set of candidate snippets via vector search, then re-ranks those candidates using Winnowing on exact fingerprints. We train and evaluate our system on a 10M-snippet subset of the THESTACKV2 dataset, with both verbatim and adapted snippets that emulate realistic identifier renaming. On an in vitro 100k-snippet search space with adapted queries, our hybrid approach reaches a mean reciprocal rank on par with Winnowing for 30-token fragments. Then, starting from windows >= 60 tokens, it consistently over-performs by up to 5.4% while preserving logarithmic-time query complexity. In a complementary evaluation using an LLM-based judge, we find that many retrieved snippets not labeled as ground truth are still highly similar to the expected sources, particularly with longer context windows, and thus remain useful for end users. Overall, our results demonstrate that integrating vector search with fingerprinting enables scalable, high-precision provenance tracking for code produced by LLMs.
Abstract（参考訳）: コード補完と生成のための大規模言語モデル(LLM)は、ソフトウェア開発でますます使われているが、著者による帰属のない訓練例を再現し、盗作とライセンス遵守に関する法的および倫理的懸念を提起する。 Winnowingのような指紋認証に基づく古典的な指紋認証は依然として有効であるが、検査ではコードの断片をトレーニングセット全体と比較する必要がある。このギャップを埋めるために,コード検索に適した300MパラメトリックエンコーダであるSOURCETRACKERと,ハイブリッドな2段実測追跡パイプラインHYBRIDSOURCETRACKER(HST)を紹介する。 HSTはまず、ベクター検索を通じて候補スニペットの小さなセットを絞り込み、それからWinnowingを使って正確な指紋でそれらの候補を再ランクする。 TheSTACKV2データセットの10M-snippetサブセット上で,現実的な識別子のリネームをエミュレートした冗長スニペットと適応スニペットをトレーニングし,評価する。 In vitro 100k-snippet search space with adapt query, our hybrid approach to reach a mean reciprocal rank on par as Winnowing for 30-token fragments。次に、ウィンドウ>=60トークンから、対数時間クエリの複雑さを保ちながら、常に最大5.4%のオーバーパフォーマンスを実現する。 LLM ベースの判定器を用いて補完的な評価を行った結果,検索された多くのスニペットが,特に長期のコンテキストウインドウにおいて,期待されるソースと非常によく似ていることが判明した。以上の結果から,LLMが生成するコードに対して,ベクトル探索と指紋認証の統合により,スケーラブルで高精度なプロファイランストラッキングが可能であることが示唆された。

論文の概要: Efficient and Scalable Provenance Tracking for LLM-Generated Code Snippets

関連論文リスト