Related papers: AST-T5: Structure-Aware Pretraining for Code Generation and Understanding

AST-T5: Structure-Aware Pretraining for Code Generation and Understanding

URL: http://arxiv.org/abs/2401.03003v4
Date: Sun, 23 Jun 2024 01:24:33 GMT
Title: AST-T5: Structure-Aware Pretraining for Code Generation and Understanding
Authors: Linyuan Gong, Mostafa Elhoushi, Alvin Cheung,
Abstract summary: Large language models (LLMs) have made significant advancements in code-related tasks, yet many LLMs treat code as simple sequences. We introduce AST-T5, a novel pretraining paradigm that leverages the Abstract Syntax Tree (AST) for enhanced code generation, transpilation, and understanding.
Score: 12.929578932351298
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have made significant advancements in code-related tasks, yet many LLMs treat code as simple sequences, neglecting its structured nature. We introduce AST-T5, a novel pretraining paradigm that leverages the Abstract Syntax Tree (AST) for enhanced code generation, transpilation, and understanding. Using dynamic programming, our AST-Aware Segmentation retains code structure, while our AST-Aware Span Corruption objective equips the model to reconstruct various code structures. Unlike other models, AST-T5 avoids intricate program analyses or architectural changes, so it integrates seamlessly with any encoder-decoder Transformer. Evaluations show that AST-T5 consistently outperforms similar-sized LMs across various code-related tasks. Structure-awareness makes AST-T5 particularly powerful in code-to-code tasks, surpassing CodeT5 by 2 points in exact match score for the Bugs2Fix task and by 3 points in exact match score for Java-C# Transpilation in CodeXGLUE. Our code and model are publicly available at https://github.com/gonglinyuan/ast_t5.

Related papers

SparseCoder: Identifier-Aware Sparse Transformer for File-Level Code Summarization [51.67317895094664]
This paper studies file-level code summarization, which can assist programmers in understanding and maintaining large source code projects. We propose SparseCoder, an identifier-aware sparse transformer for effectively handling long code sequences.
arXiv Detail & Related papers (2024-01-26T09:23:27Z)
AST-MHSA : Code Summarization using Multi-Head Self-Attention [1.588193964339148]
We present a model, AST-MHSA, that uses multi-head attention to extract semantic information from the abstract syntax tree (AST) of the code. The model is trained on a dataset of code and summaries, and the parameters are optimized to minimize the loss between the generated summaries and the ground-truth summaries.
arXiv Detail & Related papers (2023-08-10T15:43:46Z)
CodeT5+: Open Code Large Language Models for Code Understanding and Generation [72.1638273937025]
Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence. CodeT5+ is a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of downstream code tasks. We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning.
arXiv Detail & Related papers (2023-05-13T14:23:07Z)
Outline, Then Details: Syntactically Guided Coarse-To-Fine Code Generation [61.50286000143233]
ChainCoder is a program synthesis language model that generates Python code progressively. A tailored transformer architecture is leveraged to jointly encode the natural language descriptions and syntactically aligned I/O data samples.
arXiv Detail & Related papers (2023-04-28T01:47:09Z)
M2TS: Multi-Scale Multi-Modal Approach Based on Transformer for Source Code Summarization [0.4061135251278187]
Source code summarization aims to generate natural language descriptions of code snippets. We propose M2TS, a Multi-scale Multi-modal approach based on Transformer for source code Summarization. We conduct experiments on two Java and one Python datasets, and the experimental results demonstrate that M2TS outperforms current state-of-the-art methods.
arXiv Detail & Related papers (2022-03-18T02:54:06Z)
UniXcoder: Unified Cross-Modal Pre-training for Code Representation [65.6846553962117]
We present UniXcoder, a unified cross-modal pre-trained model for programming language. We propose a one-to-one mapping method to transform AST in a sequence structure that retains all structural information from the tree. We evaluate UniXcoder on five code-related tasks over nine datasets.
arXiv Detail & Related papers (2022-03-08T04:48:07Z)
AST-Transformer: Encoding Abstract Syntax Trees Efficiently for Code Summarization [14.225206904493627]
We propose AST-Transformer to efficiently encode tree-structured ASTs. Experiments show that AST-Transformer outperforms the state-of-arts by a substantial margin.
arXiv Detail & Related papers (2021-12-02T12:57:22Z)
Contrastive Learning for Source Code with Structural and Functional Properties [66.10710134948478]
We present BOOST, a novel self-supervised model to focus pre-training based on the characteristics of source code. We employ automated, structure-guided code transformation algorithms that generate functionally equivalent code that looks drastically different from the original one. We train our model in a way that brings the functionally equivalent code closer and distinct code further through a contrastive learning objective.
arXiv Detail & Related papers (2021-10-08T02:56:43Z)
GraphCodeBERT: Pre-training Code Representations with Data Flow [97.00641522327699]
We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code. We use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables. We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement.
arXiv Detail & Related papers (2020-09-17T15:25:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.