A Syntax-Guided Multi-Task Learning Approach for Turducken-Style Code
- URL: http://arxiv.org/abs/2303.05061v2
- Date: Sat, 29 Jul 2023 03:08:13 GMT
- Title: A Syntax-Guided Multi-Task Learning Approach for Turducken-Style Code
- Authors: Guang Yang, Yu Zhou, Xiang Chen, Xiangyu Zhang, Yiran Xu, Tingting
Han, Taolue Chen
- Abstract summary: We propose a syntax-guided multi-task learning approach TurduckenGen.
Specifically, we first explicitly append the type information to the code tokens to capture the representation of syntactic constraints.
Then we formalize code generation with syntactic constraint representation as an auxiliary task to enable the model to learn the syntactic constraints of the code.
- Score: 19.489202790935902
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Due to the development of pre-trained language models, automated code
generation techniques have shown great promise in recent years. However, the
generated code is difficult to meet the syntactic constraints of the target
language, especially in the case of Turducken-style code, where declarative
code snippets are embedded within imperative programs. In this study, we
summarize the lack of syntactic constraints into three significant challenges:
(1) the efficient representation of syntactic constraints, (2) the effective
integration of syntactic information, and (3) the scalable syntax-first
decoding algorithm. To address these challenges, we propose a syntax-guided
multi-task learning approach TurduckenGen. Specifically, we first explicitly
append the type information to the code tokens to capture the representation of
syntactic constraints. Then we formalize code generation with syntactic
constraint representation as an auxiliary task to enable the model to learn the
syntactic constraints of the code. Finally, the syntactically correct code is
selected accurately from the multiple candidates with the help of the compiler
feedback. Extensive experiments and comprehensive analysis demonstrate the
effectiveness and general applicability of our approach after being compared
with six state-of-the-art baselines on two Turducken-style code datasets.
Finally, we conducted a human study and found the code quality generated by our
approach is better than baselines in terms of code readability and semantic
Related papers
- NoviCode: Generating Programs from Natural Language Utterances by Novices [59.71218039095155]
We present NoviCode, a novel NL Programming task which takes as input an API and a natural language description by a novice non-programmer.
We show that NoviCode is indeed a challenging task in the code synthesis domain, and that generating complex code from non-technical instructions goes beyond the current Text-to-Code paradigm.
arXiv Detail & Related papers (2024-07-15T11:26:03Z) - Contrastive Prompt Learning-based Code Search based on Interaction
Matrix [5.379749366580253]
We propose CPLCS, a contrastive prompt learning-based code search method based on the cross-modal interaction mechanism.
We conduct extensive experiments to evaluate the effectiveness of our approach on a real-world dataset across six programming languages.
arXiv Detail & Related papers (2023-10-10T06:24:52Z) - Benchmarking Language Models for Code Syntax Understanding [79.11525961219591]
Pre-trained language models have demonstrated impressive performance in both natural language processing and program understanding.
In this work, we perform the first thorough benchmarking of the state-of-the-art pre-trained models for identifying the syntactic structures of programs.
Our findings point out key limitations of existing pre-training methods for programming languages, and suggest the importance of modeling code syntactic structures.
arXiv Detail & Related papers (2022-10-26T04:47:18Z) - Soft-Labeled Contrastive Pre-training for Function-level Code
Representation [127.71430696347174]
We present textbfSCodeR, a textbfSoft-labeled contrastive pre-training framework with two positive sample construction methods.
Considering the relevance between codes in a large-scale code corpus, the soft-labeled contrastive pre-training can obtain fine-grained soft-labels.
SCodeR achieves new state-of-the-art performance on four code-related tasks over seven datasets.
arXiv Detail & Related papers (2022-10-18T05:17:37Z) - Enhancing Semantic Code Search with Multimodal Contrastive Learning and
Soft Data Augmentation [50.14232079160476]
We propose a new approach with multimodal contrastive learning and soft data augmentation for code search.
We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages.
arXiv Detail & Related papers (2022-04-07T08:49:27Z) - What Do They Capture? -- A Structural Analysis of Pre-Trained Language
Models for Source Code [32.345301158791045]
Pre-trained language models for source code have been proposed to model the context of code.
These models leverage masked pre-training and Transformer.
It is not clear why these models work and what feature correlations they can capture.
arXiv Detail & Related papers (2022-02-14T16:22:10Z) - CLSEBERT: Contrastive Learning for Syntax Enhanced Code Pre-Trained
Model [23.947178895479464]
We propose CLSEBERT, a Constrastive Learning Framework for Syntax Enhanced Code Pre-Trained Model.
In the pre-training stage, we consider the code syntax and hierarchy contained in the Abstract Syntax Tree (AST)
We also introduce two novel pre-training objectives. One is to predict the edges between nodes in the abstract syntax tree, and the other is to predict the types of code tokens.
arXiv Detail & Related papers (2021-08-10T10:08:21Z) - Adversarial Training for Code Retrieval with Question-Description
Relevance Regularization [34.29822107097347]
We adapt a simple adversarial learning technique to generate difficult code snippets given the input question.
We propose to leverage question-description relevance to regularize adversarial learning.
Our adversarial learning method is able to improve the performance of state-of-the-art models.
arXiv Detail & Related papers (2020-10-19T19:32:03Z) - GraphCodeBERT: Pre-training Code Representations with Data Flow [97.00641522327699]
We present GraphCodeBERT, a pre-trained model for programming language that considers the inherent structure of code.
We use data flow in the pre-training stage, which is a semantic-level structure of code that encodes the relation of "where-the-value-comes-from" between variables.
We evaluate our model on four tasks, including code search, clone detection, code translation, and code refinement.
arXiv Detail & Related papers (2020-09-17T15:25:56Z) - CoreGen: Contextualized Code Representation Learning for Commit Message
Generation [39.383390029545865]
We propose a novel Contextualized code representation learning strategy for commit message Generation (CoreGen)
Experiments on the benchmark dataset demonstrate the superior effectiveness of our model over the baseline models with at least 28.18% improvement in terms of BLEU-4 score.
arXiv Detail & Related papers (2020-07-14T09:43:26Z) - A Transformer-based Approach for Source Code Summarization [86.08359401867577]
We learn code representation for summarization by modeling the pairwise relationship between code tokens.
We show that despite the approach is simple, it outperforms the state-of-the-art techniques by a significant margin.
arXiv Detail & Related papers (2020-05-01T23:29:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.