Related papers: CODEP: Grammatical Seq2Seq Model for General-Purpose Code Generation

CODEP: Grammatical Seq2Seq Model for General-Purpose Code Generation

URL: http://arxiv.org/abs/2211.00818v1
Date: Wed, 2 Nov 2022 01:40:18 GMT
Title: CODEP: Grammatical Seq2Seq Model for General-Purpose Code Generation
Authors: Yihong Dong, Ge Li
Abstract summary: General-purpose code generation aims to automatically convert the natural language (NL) description to code snippets in a general-purpose programming language (GPL) like Python. Existing sequence-to-sequence (Seq2Seq) approaches generate the code neglecting the grammar rules. We propose CODEP, a grammatical Seq2Seq code generation framework equipped with a Pushdown automaton (PDA) module.
Score: 13.702504014245713
License: http://creativecommons.org/licenses/by/4.0/
Abstract: General-purpose code generation aims to automatically convert the natural language (NL) description to code snippets in a general-purpose programming language (GPL) like Python. Intrinsically, code generation is a special type of text generation that generates well-formed text, i.e., code. However, existing sequence-to-sequence (Seq2Seq) approaches generate the GPL code neglecting the grammar rules. To this end, in this paper, we make the first attempt to consider grammatical Seq2Seq models for general-purpose code generation and propose CODEP, a grammatical Seq2Seq code generation framework equipped with a Pushdown automaton (PDA) module. In the training stage, CODEP additionally incorporates the state representation and the state prediction task, which leverages PDA states to help CODEP comprehend the parsing process of the PDA module. In the inference stage, CODEP generates well-formed code with the PDA module and the joint prediction of PDA states. Furthermore, the PDA module can be directly applied to Seq2Seq models without training to ensure the grammatical correctness of the generated code. To evaluate the effectiveness of our proposed method, we construct the DPA for the most popular GPL Python and conduct extensive experiments on four benchmark datasets. The experimental results demonstrate the superiority of CODEP compared to the state-of-the-art approaches without pre-training, and the DPA module also achieves significant improvements on the pre-trained models.

Related papers

IFEvalCode: Controlled Code Generation [69.28317223249358]
The paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs.<n>The authors present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages.
arXiv Detail & Related papers (2025-07-30T08:08:48Z)
Type-Constrained Code Generation with Language Models [51.03439021895432]
We introduce a type-constrained decoding approach that leverages type systems to guide code generation.<n>For this purpose, we develop novel prefix automata and a search over inhabitable types, forming a sound approach to enforce well-typedness on LLM-generated code.<n>Our approach reduces compilation errors by more than half and significantly increases functional correctness in code synthesis, translation, and repair tasks.
arXiv Detail & Related papers (2025-04-12T15:03:00Z)
Decoding at the Speed of Thought: Harnessing Parallel Decoding of Lexical Units for LLMs [57.27982780697922]
Large language models have demonstrated exceptional capability in natural language understanding and generation. However, their generation speed is limited by the inherently sequential nature of their decoding process. This paper introduces Lexical Unit Decoding, a novel decoding methodology implemented in a data-driven manner.
arXiv Detail & Related papers (2024-05-24T04:35:13Z)
CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation [58.84212778960507]
We propose CodeGRAG, a Graphical Retrieval Augmented Code Generation framework to enhance the performance of LLMs. CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to fill the gap between programming languages and natural language. Various experiments and ablations are done on four datasets including both the C++ and python languages to validate the hard meta-graph prompt, the soft prompting technique, and the effectiveness of the objectives for pretrained GNN expert.
arXiv Detail & Related papers (2024-05-03T02:48:55Z)
CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code [56.019447113206006]
Large Language Models (LLMs) have achieved remarkable progress in code generation. CodeIP is a novel multi-bit watermarking technique that embeds additional information to preserve provenance details. Experiments conducted on a real-world dataset across five programming languages demonstrate the effectiveness of CodeIP.
arXiv Detail & Related papers (2024-04-24T04:25:04Z)
Zero-Shot Code Representation Learning via Prompt Tuning [6.40875582886359]
We propose Zecoler, a zero-shot approach for learning code representations. Zecoler is built upon a pre-trained programming language model. We evaluate Zecoler in five code intelligence tasks including code clone detection, code search, method name prediction, code summarization, and code generation.
arXiv Detail & Related papers (2024-04-13T09:47:07Z)
Context Perception Parallel Decoder for Scene Text Recognition [52.620841341333524]
Scene text recognition methods have struggled to attain high accuracy and fast inference speed. We present an empirical study of AR decoding in STR, and discover that the AR decoder not only models linguistic context, but also provides guidance on visual context perception. We construct a series of CPPD models and also plug the proposed modules into existing STR decoders. Experiments on both English and Chinese benchmarks demonstrate that the CPPD models achieve highly competitive accuracy while running approximately 8x faster than their AR-based counterparts.
arXiv Detail & Related papers (2023-07-23T09:04:13Z)
Code Execution with Pre-trained Language Models [88.04688617516827]
Most pre-trained models for code intelligence ignore the execution trace and only rely on source code and syntactic structures. We develop a mutation-based data augmentation technique to create a large-scale and realistic Python dataset and task for code execution. We then present CodeExecutor, a Transformer model that leverages code execution pre-training and curriculum learning to enhance its semantic comprehension.
arXiv Detail & Related papers (2023-05-08T10:00:05Z)
Stochastic Code Generation [1.7205106391379026]
Large language models pre-trained for code generation can generate high-quality short code but often struggle with generating coherent long code. This issue is also observed in language modeling for long text generation. In this study, we investigate whether this technique can be applied to code generation to improve coherence.
arXiv Detail & Related papers (2023-04-14T00:01:05Z)
Knowledge Transfer for Pseudo-code Generation from Low Resource Programming Language [13.716669765394293]
We focus on transferring the knowledge acquired by the code-to-pseudocode neural model trained on a high resource PL (C++) using parallel code-pseudocode data. We observe an improvement of 23.27% in the success rate of the generated C codes through back translation.
arXiv Detail & Related papers (2023-03-16T03:38:08Z)
PAC Prediction Sets for Large Language Models of Code [19.071829387911276]
We propose a solution that considers a restricted set of prediction sets that can compactly be represented as partial programs. This is the first research contribution that generates PAC prediction sets for generative code models.
arXiv Detail & Related papers (2023-02-17T05:32:24Z)
CodeBERT: A Pre-Trained Model for Programming and Natural Languages [117.34242908773061]
CodeBERT is a pre-trained model for programming language (PL) and nat-ural language (NL) We develop CodeBERT with Transformer-based neural architecture. We evaluate CodeBERT on two NL-PL applications by fine-tuning model parameters.
arXiv Detail & Related papers (2020-02-19T13:09:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.