AstBERT: Enabling Language Model for Code Understanding with Abstract
Syntax Tree
- URL: http://arxiv.org/abs/2201.07984v1
- Date: Thu, 20 Jan 2022 03:27:26 GMT
- Title: AstBERT: Enabling Language Model for Code Understanding with Abstract
Syntax Tree
- Authors: Rong Liang, Yujie Lu, Zhen Huang, Tiehua Zhang, Yuze Liu
- Abstract summary: We propose the AstBERT model, a pre-trained language model aiming to better understand the programming language (PL) using the abstract syntax tree (AST)
Specifically, we collect a colossal amount of source codes (both java and python) from GitHub, in which information of the source codes can be interpreted and integrated.
Experiment results show that our AstBERT model achieves state-of-the-art performance on both downstream tasks.
- Score: 3.1087379479634927
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Using a pre-trained language model (i.e. BERT) to apprehend source codes has
attracted increasing attention in the natural language processing community.
However, there are several challenges when it comes to applying these language
models to solve programming language (PL) related problems directly, the
significant one of which is the lack of domain knowledge issue that
substantially deteriorates the model's performance. To this end, we propose the
AstBERT model, a pre-trained language model aiming to better understand the PL
using the abstract syntax tree (AST). Specifically, we collect a colossal
amount of source codes (both java and python) from GitHub and incorporate the
contextual code knowledge into our model through the help of code parsers, in
which AST information of the source codes can be interpreted and integrated. We
verify the performance of the proposed model on code information extraction and
code search tasks, respectively. Experiment results show that our AstBERT model
achieves state-of-the-art performance on both downstream tasks (with 96.4% for
code information extraction task, and 57.12% for code search task).
Related papers
- Less is More: Making Smaller Language Models Competent Subgraph Retrievers for Multi-hop KGQA [51.3033125256716]
We model the subgraph retrieval task as a conditional generation task handled by small language models.
Our base generative subgraph retrieval model, consisting of only 220M parameters, competitive retrieval performance compared to state-of-the-art models.
Our largest 3B model, when plugged with an LLM reader, sets new SOTA end-to-end performance on both the WebQSP and CWQ benchmarks.
arXiv Detail & Related papers (2024-10-08T15:22:36Z) - Can OpenSource beat ChatGPT? -- A Comparative Study of Large Language Models for Text-to-Code Generation [0.24578723416255752]
We evaluate five different large language models (LLMs) concerning their capabilities for text-to-code generation.
ChatGPT can handle these typical programming challenges by far the most effectively, surpassing even code-specialized models like Code Llama.
arXiv Detail & Related papers (2024-09-06T10:03:49Z) - CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation [58.84212778960507]
We propose CodeGRAG, a Graphical Retrieval Augmented Code Generation framework to enhance the performance of LLMs.
CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to fill the gap between programming languages and natural language.
Various experiments and ablations are done on four datasets including both the C++ and python languages to validate the hard meta-graph prompt, the soft prompting technique, and the effectiveness of the objectives for pretrained GNN expert.
arXiv Detail & Related papers (2024-05-03T02:48:55Z) - CodeGen2: Lessons for Training LLMs on Programming and Natural Languages [116.74407069443895]
We unify encoder and decoder-based models into a single prefix-LM.
For learning methods, we explore the claim of a "free lunch" hypothesis.
For data distributions, the effect of a mixture distribution and multi-epoch training of programming and natural languages on model performance is explored.
arXiv Detail & Related papers (2023-05-03T17:55:25Z) - Teaching Large Language Models to Self-Debug [62.424077000154945]
Large language models (LLMs) have achieved impressive performance on code generation.
We propose Self- Debugging, which teaches a large language model to debug its predicted program via few-shot demonstrations.
arXiv Detail & Related papers (2023-04-11T10:43:43Z) - CLSEBERT: Contrastive Learning for Syntax Enhanced Code Pre-Trained
Model [23.947178895479464]
We propose CLSEBERT, a Constrastive Learning Framework for Syntax Enhanced Code Pre-Trained Model.
In the pre-training stage, we consider the code syntax and hierarchy contained in the Abstract Syntax Tree (AST)
We also introduce two novel pre-training objectives. One is to predict the edges between nodes in the abstract syntax tree, and the other is to predict the types of code tokens.
arXiv Detail & Related papers (2021-08-10T10:08:21Z) - BERT2Code: Can Pretrained Language Models be Leveraged for Code Search? [0.7953229555481884]
We show that our model learns the inherent relationship between the embedding spaces and further probes into the scope of improvement.
In this analysis, we show that the quality of the code embedding model is the bottleneck for our model's performance.
arXiv Detail & Related papers (2021-04-16T10:28:27Z) - Exploring Software Naturalness through Neural Language Models [56.1315223210742]
The Software Naturalness hypothesis argues that programming languages can be understood through the same techniques used in natural language processing.
We explore this hypothesis through the use of a pre-trained transformer-based language model to perform code analysis tasks.
arXiv Detail & Related papers (2020-06-22T21:56:14Z) - CodeBERT: A Pre-Trained Model for Programming and Natural Languages [117.34242908773061]
CodeBERT is a pre-trained model for programming language (PL) and nat-ural language (NL)
We develop CodeBERT with Transformer-based neural architecture.
We evaluate CodeBERT on two NL-PL applications by fine-tuning model parameters.
arXiv Detail & Related papers (2020-02-19T13:09:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.