Abstract Syntax Tree for Programming Language Understanding and
Representation: How Far Are We?
- URL: http://arxiv.org/abs/2312.00413v1
- Date: Fri, 1 Dec 2023 08:37:27 GMT
- Title: Abstract Syntax Tree for Programming Language Understanding and
Representation: How Far Are We?
- Authors: Weisong Sun and Chunrong Fang and Yun Miao and Yudu You and Mengzhe
Yuan and Yuchen Chen and Quanjun Zhang and An Guo and Xiang Chen and Yang Liu
and Zhenyu Chen
- Abstract summary: Programming language understanding and representation (a.k.a code representation learning) has always been a hot and challenging task in software engineering.
The abstract syntax tree (AST), a fundamental code feature, illustrates the syntactic information of the source code and has been widely used in code representation learning.
We compare the performance of models trained with code token sequence (Token for short) based code representation and AST-based code representation on three popular types of code-related tasks.
- Score: 23.52632194060246
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Programming language understanding and representation (a.k.a code
representation learning) has always been a hot and challenging task in software
engineering. It aims to apply deep learning techniques to produce numerical
representations of the source code features while preserving its semantics.
These representations can be used for facilitating subsequent code-related
tasks. The abstract syntax tree (AST), a fundamental code feature, illustrates
the syntactic information of the source code and has been widely used in code
representation learning. However, there is still a lack of systematic and
quantitative evaluation of how well AST-based code representation facilitates
subsequent code-related tasks. In this paper, we first conduct a comprehensive
empirical study to explore the effectiveness of the AST-based code
representation in facilitating follow-up code-related tasks. To do so, we
compare the performance of models trained with code token sequence (Token for
short) based code representation and AST-based code representation on three
popular types of code-related tasks. Surprisingly, the overall quantitative
statistical results demonstrate that models trained with AST-based code
representation consistently perform worse across all three tasks compared to
models trained with Token-based code representation. Our further quantitative
analysis reveals that models trained with AST-based code representation
outperform models trained with Token-based code representation in certain
subsets of samples across all three tasks. We also conduct comprehensive
experiments to evaluate and reveal the impact of the choice of AST
parsing/preprocessing/encoding methods on AST-based code representation and
subsequent code-related tasks. Our study provides future researchers with
detailed guidance on how to select solutions at each stage to fully exploit
AST.
Related papers
- Code Execution with Pre-trained Language Models [88.04688617516827]
Most pre-trained models for code intelligence ignore the execution trace and only rely on source code and syntactic structures.
We develop a mutation-based data augmentation technique to create a large-scale and realistic Python dataset and task for code execution.
We then present CodeExecutor, a Transformer model that leverages code execution pre-training and curriculum learning to enhance its semantic comprehension.
arXiv Detail & Related papers (2023-05-08T10:00:05Z) - xASTNN: Improved Code Representations for Industrial Practice [30.45577773085939]
We present xASTNN, an eXtreme Abstract Syntax Tree (AST)-based Neural Network for source code representation.
First, xASTNN is completely based on widely-used ASTs and does not require complicated data pre-processing.
Second, three closely-related designs are proposed to guarantee the effectiveness of xASTNN.
Third, a dynamic algorithm is introduced to significantly reduce the time complexity of xASTNN.
arXiv Detail & Related papers (2023-03-13T13:42:13Z) - Soft-Labeled Contrastive Pre-training for Function-level Code
Representation [127.71430696347174]
We present textbfSCodeR, a textbfSoft-labeled contrastive pre-training framework with two positive sample construction methods.
Considering the relevance between codes in a large-scale code corpus, the soft-labeled contrastive pre-training can obtain fine-grained soft-labels.
SCodeR achieves new state-of-the-art performance on four code-related tasks over seven datasets.
arXiv Detail & Related papers (2022-10-18T05:17:37Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - ReACC: A Retrieval-Augmented Code Completion Framework [53.49707123661763]
We propose a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval.
We evaluate our approach in the code completion task in Python and Java programming languages, achieving a state-of-the-art performance on CodeXGLUE benchmark.
arXiv Detail & Related papers (2022-03-15T08:25:08Z) - UniXcoder: Unified Cross-Modal Pre-training for Code Representation [65.6846553962117]
We present UniXcoder, a unified cross-modal pre-trained model for programming language.
We propose a one-to-one mapping method to transform AST in a sequence structure that retains all structural information from the tree.
We evaluate UniXcoder on five code-related tasks over nine datasets.
arXiv Detail & Related papers (2022-03-08T04:48:07Z) - CLSEBERT: Contrastive Learning for Syntax Enhanced Code Pre-Trained
Model [23.947178895479464]
We propose CLSEBERT, a Constrastive Learning Framework for Syntax Enhanced Code Pre-Trained Model.
In the pre-training stage, we consider the code syntax and hierarchy contained in the Abstract Syntax Tree (AST)
We also introduce two novel pre-training objectives. One is to predict the edges between nodes in the abstract syntax tree, and the other is to predict the types of code tokens.
arXiv Detail & Related papers (2021-08-10T10:08:21Z) - On the Impact of Multiple Source Code Representations on Software
Engineering Tasks -- An Empirical Study [4.049850026698639]
We modify an AST path-based approach to accept multiple representations as input to an attention-based model.
We evaluate our approach on three tasks: Method Naming, Program Classification, and Clone Detection.
arXiv Detail & Related papers (2021-06-21T08:36:38Z) - InferCode: Self-Supervised Learning of Code Representations by
Predicting Subtrees [17.461451218469062]
This paper proposes InferCode to overcome the limitation by adapting the self-language learning mechanism to build source code model.
Subtrees in ASTs are treated with InferCode as the labels for training code representations without any human labeling effort or the overhead of expensive graph construction.
Compared to previous code learning techniques applied to the same downstream tasks, such as Code2Vec, Code2Seq, ASTNN, higher performance results are achieved using our pre-trained InferCode model.
arXiv Detail & Related papers (2020-12-13T10:33:41Z) - GraphCodeBERT: Pre-training Code Representations with Data Flow [97.00641522327699]
We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code.
We use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables.
We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement.
arXiv Detail & Related papers (2020-09-17T15:25:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.