Fugu-MT 論文翻訳(概要): Functional Consistency of LLM Code Embeddings: A Self-Evolving Data Synthesis Framework for Benchmarking

論文の概要: Functional Consistency of LLM Code Embeddings: A Self-Evolving Data Synthesis Framework for Benchmarking

arxiv url: http://arxiv.org/abs/2508.19558v1
Date: Wed, 27 Aug 2025 04:17:02 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-28 19:07:41.486815
Title: Functional Consistency of LLM Code Embeddings: A Self-Evolving Data Synthesis Framework for Benchmarking
Title（参考訳）: LLMコード埋め込みの機能一貫性:ベンチマークのための自己進化型データ合成フレームワーク
Authors: Zhuohao Li, Wenqing Chen, Jianxing Yu, Zhichao Lu,
Abstract要約: 埋め込みモデルは、クラスタリング、検索、特徴抽出といったタスクにおいて強力な性能を示し、生成モデルやクロスエンコーダよりも計算上の利点を提供している。本稿では,多種多様なベンチマークを構築するために,関数指向コード自己進化という新しいデータ合成フレームワークを提案する。私たちのフレームワークは、単一のコードインスタンスから4つのユニークなバリエーションを生成します。
参考スコア（独自算出の注目度）: 23.980033692974278
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Embedding models have demonstrated strong performance in tasks like clustering, retrieval, and feature extraction while offering computational advantages over generative models and cross-encoders. Benchmarks such as MTEB have shown that text embeddings from large language models (LLMs) capture rich semantic information, but their ability to reflect code-level functional semantics remains unclear. Existing studies largely focus on code clone detection, which emphasizes syntactic similarity and overlooks functional understanding. In this paper, we focus on the functional consistency of LLM code embeddings, which determines if two code snippets perform the same function regardless of syntactic differences. We propose a novel data synthesis framework called Functionality-Oriented Code Self-Evolution to construct diverse and challenging benchmarks. Specifically, we define code examples across four semantic and syntactic categories and find that existing datasets predominantly capture syntactic properties. Our framework generates four unique variations from a single code instance, providing a broader spectrum of code examples that better reflect functional differences. Extensive experiments on three downstream tasks-code clone detection, code functional consistency identification, and code retrieval-demonstrate that embedding models significantly improve their performance when trained on our evolved datasets. These results highlight the effectiveness and generalization of our data synthesis framework, advancing the functional understanding of code.
Abstract（参考訳）: 埋め込みモデルは、クラスタリング、検索、特徴抽出といったタスクにおいて強力な性能を示し、生成モデルやクロスエンコーダよりも計算上の利点を提供している。 MTEBなどのベンチマークでは、大きな言語モデル(LLM)からのテキスト埋め込みがリッチな意味情報をキャプチャすることを示したが、コードレベルの機能的意味論を反映する能力は未だに不明である。既存の研究は主にコードクローンの検出に重点を置いており、構文的類似性を強調し、機能的理解を見落としている。本稿では,LLMコード埋め込みの機能的整合性に着目し,構文の違いにかかわらず2つのコードスニペットが同じ機能を実行するかどうかを判定する。本稿では,多種多様なベンチマークを構築するために,関数指向コード自己進化という新しいデータ合成フレームワークを提案する。具体的には,4つのセマンティックカテゴリと構文カテゴリのコード例を定義し,既存のデータセットが主に構文特性をキャプチャすることを示す。私たちのフレームワークは、単一のコードインスタンスから4つのユニークなバリエーションを生成します。 3つの下流タスク-コードクローン検出、コード機能の整合性同定、および埋め込みモデルによるコード検索-デモレートに関する大規模な実験により、進化したデータセットでトレーニングした際のパフォーマンスが大幅に向上した。これらの結果は,我々のデータ合成フレームワークの有効性と一般化を強調し,コードの機能的理解を促進した。

論文の概要: Functional Consistency of LLM Code Embeddings: A Self-Evolving Data Synthesis Framework for Benchmarking

関連論文リスト