Fugu-MT 論文翻訳(概要): Bridging the Programming Language Gap: Constructing a Multilingual Shared Semantic Space through AST Unification and Graph Matching

論文の概要: Bridging the Programming Language Gap: Constructing a Multilingual Shared Semantic Space through AST Unification and Graph Matching

arxiv url: http://arxiv.org/abs/2605.07788v1
Date: Fri, 08 May 2026 14:26:09 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-11 19:43:39.111387
Title: Bridging the Programming Language Gap: Constructing a Multilingual Shared Semantic Space through AST Unification and Graph Matching
Title（参考訳）: プログラミング言語のギャップを埋める:AST統一とグラフマッチングによる多言語共有セマンティック空間の構築
Authors: Junhao Chen, Jingxuan Zhang, Jian He, Yixuan Tang, Weiqin Zou,
Abstract要約: 本稿では,多言語共有意味空間を構築するための新しい手法を提案する。まず、異なるプログラミング言語で書かれたコードスニペットの抽象構文木(AST)ノードラベルを統一されたラベルセットにマッピングする。次に、グラフマッチングネットワーク(GMN)を用いて、ペアのASTグラフを「意味ベクトル」に符号化する。言語間のコード検索では,平均相反ランク(MRR)が0.4909から0.5547に上昇する。
参考スコア（独自算出の注目度）: 19.367897761393436
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The lexical and syntactic disparities among different programming languages (e.g., Java and Python) pose significant challenges for multi-language software engineering tasks such as cross-language code clone detection and code retrieval, since queries or code snippets written in one programming language often fail to match equivalent artifacts in another. To bridge this gap between different programming languages, we proposed a novel approach to construct a multi-language shared semantic space, in which functionally equivalent source code written in different programming languages are close to each other. In this approach, we first map the Abstract Syntax Tree (AST) node labels of the code snippets written in different programming languages into a unified label set, thus compressing high-dimensional language-specific tokens into a common embedding space. Then, we employ a Graph Matching Network (GMN) to encode the paired AST graphs into "semantic vectors" that capture functional equivalence between programming languages in a unified code vector space. In such a way, we can eliminate the differences in syntax between different programming languages. To validate the effectiveness of this approach, we apply it to two downstream tasks, including cross-language clone detection and cross-language code retrieval. Experiments demonstrate that our approach substantially outperforms the state-of-the-art baselines in cross-language clone detection, improving Precision from 95.62% to 99.94%, Recall from 97.72% to 99.92%, and F1 score from 96.94% to 99.93%. In terms of cross-language code retrieval, our approach raises the average Mean Reciprocal Rank (MRR) from 0.4909 to 0.5547, showing an absolute gain of 0.0638 (13% relative improvement), which demonstrates its superior ability to rank correct code snippets high across multiple programming languages.
Abstract（参考訳）: 様々なプログラミング言語(例えば、JavaとPython)の語彙的および構文的相違は、言語間コードクローンの検出やコード検索といった多言語ソフトウェア工学のタスクに重大な課題をもたらす。異なるプログラミング言語間のギャップを埋めるため、我々は、異なるプログラミング言語で記述された関数的に等価なソースコードが互いに近接する多言語共有セマンティック空間を構築するための新しいアプローチを提案した。本稿では,まず,異なるプログラミング言語で記述されたコードスニペットの抽象構文木(AST)ノードラベルを統一されたラベル集合にマッピングし,高次元の言語固有のトークンを共通の埋め込み空間に圧縮する。次に、グラフマッチングネットワーク(GMN)を用いて、ペア化されたASTグラフを「意味ベクトル」に符号化し、統一されたコードベクトル空間におけるプログラミング言語間の関数的等価性を捉える。このようにして、異なるプログラミング言語間の構文の違いを排除することができる。提案手法の有効性を検証するために,クロス言語クローン検出とクロス言語コード検索を含む2つの下流タスクに適用する。実験により、我々のアプローチは言語間のクローン検出において最先端のベースラインを大幅に上回り、精度を95.62%から99.94%に改善し、リコールを97.72%から99.92%に、F1スコアを96.94%から99.93%に改善した。言語間のコード検索では,平均的平均相互ランク(MRR)が0.4909から0.5547に上昇し,0.0638(相対的改善率13%)が絶対的に向上した。

論文の概要: Bridging the Programming Language Gap: Constructing a Multilingual Shared Semantic Space through AST Unification and Graph Matching

関連論文リスト