CODEP: Grammatical Seq2Seq Model for General-Purpose Code Generation
- URL: http://arxiv.org/abs/2211.00818v1
- Date: Wed, 2 Nov 2022 01:40:18 GMT
- Title: CODEP: Grammatical Seq2Seq Model for General-Purpose Code Generation
- Authors: Yihong Dong, Ge Li
- Abstract summary: General-purpose code generation aims to automatically convert the natural language (NL) description to code snippets in a general-purpose programming language (GPL) like Python.
Existing sequence-to-sequence (Seq2Seq) approaches generate the code neglecting the grammar rules.
We propose CODEP, a grammatical Seq2Seq code generation framework equipped with a Pushdown automaton (PDA) module.
- Score: 13.702504014245713
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: General-purpose code generation aims to automatically convert the natural
language (NL) description to code snippets in a general-purpose programming
language (GPL) like Python. Intrinsically, code generation is a special type of
text generation that generates well-formed text, i.e., code. However, existing
sequence-to-sequence (Seq2Seq) approaches generate the GPL code neglecting the
grammar rules. To this end, in this paper, we make the first attempt to
consider grammatical Seq2Seq models for general-purpose code generation and
propose CODEP, a grammatical Seq2Seq code generation framework equipped with a
Pushdown automaton (PDA) module. In the training stage, CODEP additionally
incorporates the state representation and the state prediction task, which
leverages PDA states to help CODEP comprehend the parsing process of the PDA
module. In the inference stage, CODEP generates well-formed code with the PDA
module and the joint prediction of PDA states. Furthermore, the PDA module can
be directly applied to Seq2Seq models without training to ensure the
grammatical correctness of the generated code. To evaluate the effectiveness of
our proposed method, we construct the DPA for the most popular GPL Python and
conduct extensive experiments on four benchmark datasets. The experimental
results demonstrate the superiority of CODEP compared to the state-of-the-art
approaches without pre-training, and the DPA module also achieves significant
improvements on the pre-trained models.
Related papers
- Decoding at the Speed of Thought: Harnessing Parallel Decoding of Lexical Units for LLMs [57.27982780697922]
Large language models have demonstrated exceptional capability in natural language understanding and generation.
However, their generation speed is limited by the inherently sequential nature of their decoding process.
This paper introduces Lexical Unit Decoding, a novel decoding methodology implemented in a data-driven manner.
arXiv Detail & Related papers (2024-05-24T04:35:13Z) - CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation [58.84212778960507]
We propose CodeGRAG, a Graphical Retrieval Augmented Code Generation framework to enhance the performance of LLMs.
CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to fill the gap between programming languages and natural language.
Various experiments and ablations are done on four datasets including both the C++ and python languages to validate the hard meta-graph prompt, the soft prompting technique, and the effectiveness of the objectives for pretrained GNN expert.
arXiv Detail & Related papers (2024-05-03T02:48:55Z) - CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code [56.019447113206006]
Large Language Models (LLMs) have achieved remarkable progress in code generation.
CodeIP is a novel multi-bit watermarking technique that embeds additional information to preserve provenance details.
Experiments conducted on a real-world dataset across five programming languages demonstrate the effectiveness of CodeIP.
arXiv Detail & Related papers (2024-04-24T04:25:04Z) - Zero-Shot Code Representation Learning via Prompt Tuning [6.40875582886359]
We propose Zecoler, a zero-shot approach for learning code representations.
Zecoler is built upon a pre-trained programming language model.
We evaluate Zecoler in five code intelligence tasks including code clone detection, code search, method name prediction, code summarization, and code generation.
arXiv Detail & Related papers (2024-04-13T09:47:07Z) - Context Perception Parallel Decoder for Scene Text Recognition [52.620841341333524]
Scene text recognition methods have struggled to attain high accuracy and fast inference speed.
We present an empirical study of AR decoding in STR, and discover that the AR decoder not only models linguistic context, but also provides guidance on visual context perception.
We construct a series of CPPD models and also plug the proposed modules into existing STR decoders. Experiments on both English and Chinese benchmarks demonstrate that the CPPD models achieve highly competitive accuracy while running approximately 8x faster than their AR-based counterparts.
arXiv Detail & Related papers (2023-07-23T09:04:13Z) - Code Execution with Pre-trained Language Models [88.04688617516827]
Most pre-trained models for code intelligence ignore the execution trace and only rely on source code and syntactic structures.
We develop a mutation-based data augmentation technique to create a large-scale and realistic Python dataset and task for code execution.
We then present CodeExecutor, a Transformer model that leverages code execution pre-training and curriculum learning to enhance its semantic comprehension.
arXiv Detail & Related papers (2023-05-08T10:00:05Z) - Stochastic Code Generation [1.7205106391379026]
Large language models pre-trained for code generation can generate high-quality short code but often struggle with generating coherent long code.
This issue is also observed in language modeling for long text generation.
In this study, we investigate whether this technique can be applied to code generation to improve coherence.
arXiv Detail & Related papers (2023-04-14T00:01:05Z) - Knowledge Transfer for Pseudo-code Generation from Low Resource
Programming Language [13.716669765394293]
We focus on transferring the knowledge acquired by the code-to-pseudocode neural model trained on a high resource PL (C++) using parallel code-pseudocode data.
We observe an improvement of 23.27% in the success rate of the generated C codes through back translation.
arXiv Detail & Related papers (2023-03-16T03:38:08Z) - PAC Prediction Sets for Large Language Models of Code [19.071829387911276]
We propose a solution that considers a restricted set of prediction sets that can compactly be represented as partial programs.
This is the first research contribution that generates PAC prediction sets for generative code models.
arXiv Detail & Related papers (2023-02-17T05:32:24Z) - CodeBERT: A Pre-Trained Model for Programming and Natural Languages [117.34242908773061]
CodeBERT is a pre-trained model for programming language (PL) and nat-ural language (NL)
We develop CodeBERT with Transformer-based neural architecture.
We evaluate CodeBERT on two NL-PL applications by fine-tuning model parameters.
arXiv Detail & Related papers (2020-02-19T13:09:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.