Fugu-MT 論文翻訳(概要): Abstract Syntax Tree for Programming Language Understanding and Representation: How Far Are We?

論文の概要: Abstract Syntax Tree for Programming Language Understanding and Representation: How Far Are We?

arxiv url: http://arxiv.org/abs/2312.00413v1
Date: Fri, 1 Dec 2023 08:37:27 GMT
ステータス: 翻訳完了
システム内更新日: 2023-12-04 15:18:10.012532
Title: Abstract Syntax Tree for Programming Language Understanding and Representation: How Far Are We?
Title（参考訳）: プログラミング言語理解と表現のための抽象構文木:どこまで遠いのか?
Authors: Weisong Sun and Chunrong Fang and Yun Miao and Yudu You and Mengzhe Yuan and Yuchen Chen and Quanjun Zhang and An Guo and Xiang Chen and Yang Liu and Zhenyu Chen
Abstract要約: プログラミング言語の理解と表現(コード表現学習)は、ソフトウェア工学において常にホットで挑戦的なタスクである。抽象構文木(AST)は、ソースコードの構文情報を表現し、コード表現学習で広く使われている。コードトークンシーケンス(略してToken)ベースのコード表現とASTベースのコード表現でトレーニングされた3種類のコード関連タスクのパフォーマンスを比較した。
参考スコア（独自算出の注目度）: 23.52632194060246
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Programming language understanding and representation (a.k.a code representation learning) has always been a hot and challenging task in software engineering. It aims to apply deep learning techniques to produce numerical representations of the source code features while preserving its semantics. These representations can be used for facilitating subsequent code-related tasks. The abstract syntax tree (AST), a fundamental code feature, illustrates the syntactic information of the source code and has been widely used in code representation learning. However, there is still a lack of systematic and quantitative evaluation of how well AST-based code representation facilitates subsequent code-related tasks. In this paper, we first conduct a comprehensive empirical study to explore the effectiveness of the AST-based code representation in facilitating follow-up code-related tasks. To do so, we compare the performance of models trained with code token sequence (Token for short) based code representation and AST-based code representation on three popular types of code-related tasks. Surprisingly, the overall quantitative statistical results demonstrate that models trained with AST-based code representation consistently perform worse across all three tasks compared to models trained with Token-based code representation. Our further quantitative analysis reveals that models trained with AST-based code representation outperform models trained with Token-based code representation in certain subsets of samples across all three tasks. We also conduct comprehensive experiments to evaluate and reveal the impact of the choice of AST parsing/preprocessing/encoding methods on AST-based code representation and subsequent code-related tasks. Our study provides future researchers with detailed guidance on how to select solutions at each stage to fully exploit AST.
Abstract（参考訳）: プログラミング言語の理解と表現(コード表現学習)は、ソフトウェア工学において常にホットで難しい課題である。深層学習技術を適用してソースコードの特徴を数値的に表現し,セマンティクスを保存することを目的とする。これらの表現は、その後のコード関連タスクの促進に使用できる。基本的なコード機能であるabstract syntax tree(ast)は、ソースコードの構文情報を示し、コード表現学習で広く使われている。しかし、ASTベースのコード表現が後続のコード関連タスクをどのように促進するかについて、体系的で定量的な評価がまだ残っていない。本稿では,まず,astベースのコード表現がフォローアップコード関連タスクの促進に有効であるかを検討するため,包括的な実証研究を行う。そのために、コードトークンシーケンス(略してToken)ベースのコード表現と、ASTベースのコード表現でトレーニングされた3種類のコード関連タスクのパフォーマンスを比較した。驚くべきことに、全体的な定量的統計結果は、astベースのコード表現でトレーニングされたモデルは、トークンベースのコード表現でトレーニングされたモデルに比べて、3つのタスク全体で一貫してパフォーマンスが悪くなっていることを示している。さらに定量的分析の結果,ASTベースのコード表現で訓練されたモデルは,Tokenベースのコード表現で訓練されたモデルよりも優れていることがわかった。また、AST解析・前処理・エンコード手法の選択がASTベースのコード表現およびその後のコード関連タスクに与える影響を評価するための包括的な実験も行います。本研究は将来の研究者に対して,ASTを完全に活用するための各段階でのソリューションの選択方法に関する詳細なガイダンスを提供する。

関連論文リスト

Code Execution with Pre-trained Language Models [88.04688617516827]
コードインテリジェンスのトレーニング済みモデルのほとんどは実行トレースを無視しており、ソースコードと構文構造のみに依存している。我々は,大規模かつ現実的なPythonデータセットとコード実行タスクを作成するために,突然変異に基づくデータ拡張手法を開発した。次に、コード実行事前学習とカリキュラム学習を活用して意味理解を強化するトランスフォーマーモデルであるCodeExecutorを提案する。
論文参考訳（メタデータ） (2023-05-08T10:00:05Z)
xASTNN: Improved Code Representations for Industrial Practice [30.45577773085939]
ソースコード表現のためのeXtreme Abstract Syntax Tree (AST)ベースのニューラルネットワークであるxASTNNを提案する。まず、xASTNNは広く使われているASTを完全にベースとしており、複雑なデータ前処理を必要としない。第2に、xASTNNの有効性を保証するために、3つの密接な設計が提案されている。第3に、xASTNNの時間的複雑さを著しく低減するために、動的アルゴリズムが導入された。
論文参考訳（メタデータ） (2023-03-13T13:42:13Z)
Soft-Labeled Contrastive Pre-training for Function-level Code Representation [127.71430696347174]
textbfSoft-labeled contrastive pre-training framework with two positive sample construction method。大規模コードコーパスにおけるコード間の関連性を考慮すると、ソフトラベル付きコントラスト付き事前学習は、きめ細かいソフトラベルを得ることができる。 SCodeRは、7つのデータセットで4つのコード関連タスクに対して、最先端のパフォーマンスを新たに達成する。
論文参考訳（メタデータ） (2022-10-18T05:17:37Z)
Enhancing Semantic Code Search with Multimodal Contrastive Learning and Soft Data Augmentation [50.14232079160476]
コード検索のためのマルチモーダルコントラスト学習とソフトデータ拡張を用いた新しい手法を提案する。我々は,6つのプログラミング言語を用いた大規模データセットにおけるアプローチの有効性を評価するために,広範囲な実験を行った。
論文参考訳（メタデータ） (2022-04-07T08:49:27Z)
ReACC: A Retrieval-Augmented Code Completion Framework [53.49707123661763]
本稿では,語彙のコピーと類似したセマンティクスを持つコード参照の両方を検索により活用する検索拡張コード補完フレームワークを提案する。我々は,Python および Java プログラミング言語のコード補完タスクにおけるアプローチを評価し,CodeXGLUE ベンチマークで最先端のパフォーマンスを実現する。
論文参考訳（メタデータ） (2022-03-15T08:25:08Z)
UniXcoder: Unified Cross-Modal Pre-training for Code Representation [65.6846553962117]
プログラミング言語のためのクロスモーダル事前学習モデルUniXcoderを提案する。木の構造情報を全て保持するシーケンス構造でASTを変換する1対1のマッピング手法を提案する。我々は,UniXcoderを9つのデータセット上で5つのコード関連タスクで評価する。
論文参考訳（メタデータ） (2022-03-08T04:48:07Z)
CLSEBERT: Contrastive Learning for Syntax Enhanced Code Pre-Trained Model [23.947178895479464]
CLSEBERTは,構文強化符号事前学習モデルのための構築学習フレームワークである。事前学習段階では、抽象構文木(AST)に含まれるコード構文と階層について検討する。ひとつは抽象構文木内のノード間のエッジを予測することであり、もう一つはコードトークンの型を予測することである。
論文参考訳（メタデータ） (2021-08-10T10:08:21Z)
On the Impact of Multiple Source Code Representations on Software Engineering Tasks -- An Empirical Study [4.049850026698639]
我々は、ASTパスに基づくアプローチを変更し、複数の表現をアテンションベースモデルへの入力として受け入れる。提案手法は,メソッドナーミング,プログラム分類,クローン検出の3つのタスクで評価する。
論文参考訳（メタデータ） (2021-06-21T08:36:38Z)
InferCode: Self-Supervised Learning of Code Representations by Predicting Subtrees [17.461451218469062]
本稿では,自己言語学習機構をソースコードモデルに適用することにより,制限を克服するinfercodeを提案する。 ASTのサブツリーは、人間のラベル付けや高価なグラフ構築のオーバーヘッドなしにコード表現をトレーニングするためのラベルとして、InferCodeで扱われる。 Code2Vec、Code2Seq、ASTNNなど、同じ下流タスクに適用される以前のコード学習技術と比較して、事前に訓練されたInferCodeモデルを使用して、より高いパフォーマンスを達成できます。
論文参考訳（メタデータ） (2020-12-13T10:33:41Z)
GraphCodeBERT: Pre-training Code Representations with Data Flow [97.00641522327699]
本稿では,コード固有の構造を考慮したプログラミング言語の事前学習モデルであるGraphCodeBERTを提案する。これは変数間の"where-the-value-comes-from"の関係をエンコードするコードのセマンティックレベルの構造です。コード検索,クローン検出,コード翻訳,コード改良の4つのタスクにおいて,本モデルを評価する。
論文参考訳（メタデータ） (2020-09-17T15:25:56Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。