Related papers: JumpCoder: Go Beyond Autoregressive Coder via Online Modification

JumpCoder: Go Beyond Autoregressive Coder via Online Modification

URL: http://arxiv.org/abs/2401.07870v2
Date: Wed, 5 Jun 2024 14:12:03 GMT
Title: JumpCoder: Go Beyond Autoregressive Coder via Online Modification
Authors: Mouxiang Chen, Hao Tian, Zhongxin Liu, Xiaoxue Ren, Jianling Sun,
Abstract summary: We introduce JumpCoder, a novel model-agnostic framework that enables human-like online modification and non-sequential generation to augment code LLMs. The key idea behind JumpCoder is to insert new code into the currently generated code when necessary during generation, which is achieved through an auxiliary infilling model.
Score: 18.9350072969148
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While existing code large language models (code LLMs) exhibit impressive capabilities in code generation, their autoregressive sequential generation inherently lacks reversibility. This limitation hinders them from timely correcting previous missing statements during coding as humans do, often leading to error propagation and suboptimal performance. We introduce JumpCoder, a novel model-agnostic framework that enables human-like online modification and non-sequential generation to augment code LLMs. The key idea behind JumpCoder is to insert new code into the currently generated code when necessary during generation, which is achieved through an auxiliary infilling model that works in tandem with the code LLM. Since identifying the best infill position beforehand is intractable, we adopt an \textit{infill-first, judge-later} strategy, which experiments with filling at the $k$ most critical positions following the generation of each line, and uses an Abstract Syntax Tree (AST) parser alongside the Generation Model Scoring to effectively judge the validity of each potential infill. Extensive experiments using six state-of-the-art code LLMs across multiple and multilingual benchmarks consistently indicate significant improvements over all baselines. Our code is public at https://github.com/Keytoyze/JumpCoder.

Related papers

IFEvalCode: Controlled Code Generation [69.28317223249358]
The paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs.<n>The authors present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages.
arXiv Detail & Related papers (2025-07-30T08:08:48Z)
Type-Constrained Code Generation with Language Models [51.03439021895432]
We introduce a type-constrained decoding approach that leverages type systems to guide code generation.<n>For this purpose, we develop novel prefix automata and a search over inhabitable types, forming a sound approach to enforce well-typedness on LLM-generated code.<n>Our approach reduces compilation errors by more than half and significantly increases functional correctness in code synthesis, translation, and repair tasks.
arXiv Detail & Related papers (2025-04-12T15:03:00Z)
SRLCG: Self-Rectified Large-Scale Code Generation with Multidimensional Chain-of-Thought and Dynamic Backtracking [10.658653637280787]
Self-Rectified Large-Scale Code Generator (SRLCG) is a framework that generates complete multi-file project code from a single prompt. SRLCG employs a novel multidimensional chain-of-thought (CoT) and self-rectification to guide LLMs in generating correct and robust code files. Experimental results show that SRLCG generates code 15x longer than DeepSeek-V3, 16x longer than GPT-4, and at least 10x longer than other leading CoT-based baselines.
arXiv Detail & Related papers (2025-04-01T08:23:43Z)
ObscuraCoder: Powering Efficient Code LM Pre-Training Via Obfuscation Grounding [60.37988508851391]
Language models (LMs) have become a staple of the code-writing toolbox. Research exploring modifications to Code-LMs' pre-training objectives, geared towards improving data efficiency and better disentangling between syntax and semantics, has been noticeably sparse. In this work, we examine grounding on obfuscated code as a means of helping Code-LMs look beyond the surface-form syntax and enhance their pre-training sample efficiency.
arXiv Detail & Related papers (2025-03-27T23:08:53Z)
Steering Large Language Models between Code Execution and Textual Reasoning [22.279107036500083]
Textual reasoning has inherent limitations in solving tasks with challenges in math, logics, optimization, and searching. The recently released OpenAI GPT Code Interpreter and multi-agent frameworks such as AutoGen have demonstrated remarkable proficiency of integrating code generation and execution. We propose three methods to better steer LLM code/text generation and achieve a notable improvement.
arXiv Detail & Related papers (2024-10-04T15:44:47Z)
When to Stop? Towards Efficient Code Generation in LLMs with Excess Token Prevention [43.39584272739589]
We introduce CodeFast, an inference acceleration approach for Code LLMs on code generation. Key idea of CodeFast is to terminate the inference process in time when unnecessary excess tokens are detected. We conduct extensive experiments with CodeFast on five representative Code LLMs across four widely used code generation datasets.
arXiv Detail & Related papers (2024-07-29T14:27:08Z)
Beyond Functional Correctness: Investigating Coding Style Inconsistencies in Large Language Models [28.295926947968574]
Large language models (LLMs) have brought a paradigm shift to the field of code generation. We empirically analyze the differences in coding style between the code generated by Code LLMs and the code written by human developers.
arXiv Detail & Related papers (2024-06-29T14:56:11Z)
Decoding at the Speed of Thought: Harnessing Parallel Decoding of Lexical Units for LLMs [57.27982780697922]
Large language models have demonstrated exceptional capability in natural language understanding and generation. However, their generation speed is limited by the inherently sequential nature of their decoding process. This paper introduces Lexical Unit Decoding, a novel decoding methodology implemented in a data-driven manner.
arXiv Detail & Related papers (2024-05-24T04:35:13Z)
CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code [56.019447113206006]
Large Language Models (LLMs) have achieved remarkable progress in code generation. CodeIP is a novel multi-bit watermarking technique that embeds additional information to preserve provenance details. Experiments conducted on a real-world dataset across five programming languages demonstrate the effectiveness of CodeIP.
arXiv Detail & Related papers (2024-04-24T04:25:04Z)
Exploring Data-Efficient Adaptation of Large Language Models for Code Generation [64.5583894165813]
We propose a novel adaptation approach named DEED, which stands for Data-Efficient adaptation with Error-Driven learning for code generation.<n> Experimental results show that, compared to other mainstream fine-tuning approaches, DEED achieves superior performance with few training data.
arXiv Detail & Related papers (2024-02-29T16:09:02Z)
StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback [58.20547418182074]
We introduce StepCoder, a novel framework for code generation, consisting of two main components. CCCS addresses the exploration challenge by breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks. FGO only optimize the model by masking the unexecuted code segments to provide Fine-Grained Optimization. Our method improves the ability to explore the output space and outperforms state-of-the-art approaches in corresponding benchmarks.
arXiv Detail & Related papers (2024-02-02T13:14:31Z)
Bridging Code Semantic and LLMs: Semantic Chain-of-Thought Prompting for Code Generation [22.219645213202178]
This paper proposes the "Semantic Chain-of-Thought" approach to intruduce semantic information of code, named SeCoT. We show that SeCoT can achieves state-of-the-art performance, greatly improving the potential for large models and code generation.
arXiv Detail & Related papers (2023-10-16T05:09:58Z)
PanGu-Coder2: Boosting Large Language Models for Code with Ranking Feedback [5.459517921633247]
We propose a novel RRTF (Rank Responses to align Test&Teacher Feedback) framework, which can effectively and efficiently boost pre-trained large language models for code generation. Under this framework, we present PanGu-Coder2, which achieves 62.20% pass@1 on the OpenAI HumanEval benchmark.
arXiv Detail & Related papers (2023-07-27T15:28:29Z)
LongCoder: A Long-Range Pre-trained Language Model for Code Completion [56.813974784131624]
LongCoder employs a sliding window mechanism for self-attention and introduces two types of globally accessible tokens. Bridge tokens are inserted throughout the input sequence to aggregate local information and facilitate global interaction. memory tokens are included to highlight important statements that may be invoked later and need to be memorized.
arXiv Detail & Related papers (2023-06-26T17:59:24Z)
CodeT5+: Open Code Large Language Models for Code Understanding and Generation [72.1638273937025]
Large language models (LLMs) pretrained on vast source code have achieved prominent progress in code intelligence. CodeT5+ is a family of encoder-decoder LLMs for code in which component modules can be flexibly combined to suit a wide range of downstream code tasks. We extensively evaluate CodeT5+ on over 20 code-related benchmarks in different settings, including zero-shot, finetuning, and instruction-tuning.
arXiv Detail & Related papers (2023-05-13T14:23:07Z)
InCoder: A Generative Model for Code Infilling and Synthesis [88.46061996766348]
We introduce InCoder, a unified generative model that can perform program synthesis (via left-to-right generation) and editing (via infilling) InCoder is trained to generate code files from a large corpus of permissively licensed code. Our model is the first generative model that is able to directly perform zero-shot code infilling.
arXiv Detail & Related papers (2022-04-12T16:25:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.