Unified Pre-training for Program Understanding and Generation
- URL: http://arxiv.org/abs/2103.06333v1
- Date: Wed, 10 Mar 2021 20:32:59 GMT
- Title: Unified Pre-training for Program Understanding and Generation
- Authors: Wasi Uddin Ahmad and Saikat Chakraborty and Baishakhi Ray and Kai-Wei
Chang
- Abstract summary: PLBART is a sequence-to-sequence model capable of performing a broad spectrum of program and language understanding and generation tasks.
PLBART is pre-trained on an extensive collection of Java and Python functions and associated NL text via denoising autoencoding.
- Score: 46.89905110678675
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Code summarization and generation empower conversion between programming
language (PL) and natural language (NL), while code translation avails the
migration of legacy code from one PL to another. This paper introduces PLBART,
a sequence-to-sequence model capable of performing a broad spectrum of program
and language understanding and generation tasks. PLBART is pre-trained on an
extensive collection of Java and Python functions and associated NL text via
denoising autoencoding. Experiments on language generation tasks, including
code summarization, generation, translation in seven programming languages show
that PLBART outperforms or rivals state-of-the-art models. Moreover,
experiments on discriminative tasks, e.g., program repair, clone detection, and
vulnerable code detection demonstrate PLBART's effectiveness in program
understanding. Furthermore, analysis reveals that PLBART learns program syntax,
style (e.g., identifier naming convention), logical flow (e.g., if block inside
an else block is equivalent to else if block) that are crucial to program
semantics and thus excels even with limited annotations.
Related papers
- NoviCode: Generating Programs from Natural Language Utterances by Novices [59.71218039095155]
We present NoviCode, a novel NL Programming task which takes as input an API and a natural language description by a novice non-programmer.
We show that NoviCode is indeed a challenging task in the code synthesis domain, and that generating complex code from non-technical instructions goes beyond the current Text-to-Code paradigm.
arXiv Detail & Related papers (2024-07-15T11:26:03Z) - Synthetic Programming Elicitation and Repair for Text-to-Code in Very Low-Resource Programming Languages [21.18996339478024]
We introduce emphsynthetic programming elicitation and compilation (SPEAC)
SPEAC produces syntactically correct programs significantly more frequently without sacrificing semantic correctness.
We empirically evaluate the performance of SPEAC in a case study and find that, compared to existing retrieval and fine-tuning baselines, SPEAC produces syntactically correct programs significantly more frequently.
arXiv Detail & Related papers (2024-06-05T22:16:19Z) - CodeGRAG: Extracting Composed Syntax Graphs for Retrieval Augmented Cross-Lingual Code Generation [60.799992690487336]
We propose Syntax Graph Retrieval Augmented Code Generation (CodeGRAG) to enhance the performance of LLMs in single-round code generation tasks.
CodeGRAG significantly improves the code generation ability of LLMs and can even offer performance gain for cross-lingual code generation.
arXiv Detail & Related papers (2024-05-03T02:48:55Z) - AdaCCD: Adaptive Semantic Contrasts Discovery Based Cross Lingual
Adaptation for Code Clone Detection [69.79627042058048]
AdaCCD is a novel cross-lingual adaptation method that can detect cloned codes in a new language without annotations in that language.
We evaluate the cross-lingual adaptation results of AdaCCD by constructing a multilingual code clone detection benchmark consisting of 5 programming languages.
arXiv Detail & Related papers (2023-11-13T12:20:48Z) - Understanding Programs by Exploiting (Fuzzing) Test Cases [26.8259045248779]
We propose to incorporate the relationship between inputs and possible outputs/behaviors into learning, for achieving a deeper semantic understanding of programs.
To obtain inputs that are representative enough to trigger the execution of most part of the code, we resort to fuzz testing and propose fuzz tuning.
The effectiveness of the proposed method is verified on two program understanding tasks including code clone detection and code classification, and it outperforms current state-of-the-arts by large margins.
arXiv Detail & Related papers (2023-05-23T01:51:46Z) - LEVER: Learning to Verify Language-to-Code Generation with Execution [64.36459105535]
We propose LEVER, a simple approach to improve language-to-code generation by learning to verify the generated programs with their execution results.
Specifically, we train verifiers to determine whether a program sampled from the LLMs is correct or not based on the natural language input, the program itself and its execution results.
LEVER consistently improves over the base code LLMs(4.6% to 10.9% with code-davinci) and achieves new state-of-the-art results on all of them.
arXiv Detail & Related papers (2023-02-16T18:23:22Z) - PanGu-Coder: Program Synthesis with Function-Level Language Modeling [47.63943623661298]
PanGu-Coder is a pretrained decoder-only language model adopting the PanGu-Alpha architecture for text-to-code generation.
We train PanGu-Coder using a two-stage strategy: the first stage employs Causal Language Modelling to pre-train on raw programming language data.
The second stage uses a combination of Causal Language Modelling and Masked Language Modelling to train on loosely curated pairs of natural language program definitions and code functions.
arXiv Detail & Related papers (2022-07-22T18:08:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.