Related papers: DOCE: Finding the Sweet Spot for Execution-Based Code Generation

DOCE: Finding the Sweet Spot for Execution-Based Code Generation

URL: http://arxiv.org/abs/2408.13745v4
Date: Wed, 16 Oct 2024 15:07:41 GMT
Title: DOCE: Finding the Sweet Spot for Execution-Based Code Generation
Authors: Haau-Sing Li, Patrick Fernandes, Iryna Gurevych, André F. T. Martins,
Abstract summary: We propose a comprehensive framework that includes candidate generation, $n$-best reranking, minimum Bayes risk (MBR) decoding, and self-ging as the core components. Our findings highlight the importance of execution-based methods and the difference gap between execution-based and execution-free methods.
Score: 69.5305729627198
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recently, a diverse set of decoding and reranking procedures have been shown effective for LLM-based code generation. However, a comprehensive framework that links and experimentally compares these methods is missing. We address this by proposing Decoding Objectives for Code Execution, a comprehensive framework that includes candidate generation, $n$-best reranking, minimum Bayes risk (MBR) decoding, and self-debugging as the core components. We then study the contributions of these components through execution-based evaluation metrics. Our findings highlight the importance of execution-based methods and the difference gap between execution-based and execution-free methods. Furthermore, we assess the impact of filtering based on trial unit tests, a simple and effective strategy that has been often overlooked in prior works. We also propose self-debugging on multiple candidates, obtaining state-of-the-art performance on reranking for code generation. We expect our framework to provide a solid guideline for future research on code generation.

Related papers

What to Retrieve for Effective Retrieval-Augmented Code Generation? An Empirical Study and Beyond [32.467437657603604]
Repository-level code generation remains challenging due to complex code dependencies and the limitations of large language models (LLMs) in processing long contexts. We propose AllianceCoder, a novel context-integrated method that employs chain-of-thought prompting to decompose user queries into implementation steps and retrieves APIs via semantic description matching. Through extensive experiments on CoderEval and RepoExec, AllianceCoder achieves state-of-the-art performance, improving Pass@1 by up to 20% over existing approaches.
arXiv Detail & Related papers (2025-03-26T14:41:38Z)
An Empirical Study of Retrieval-Augmented Code Generation: Challenges and Opportunities [19.455889970335967]
Code generation aims to automatically generate code snippets of specific programming language according to natural language descriptions. One main challenge of pre-trained models for code generation is the semantic gap between natural language requirements and source code. Retrieval-augmented framework can be leveraged to help understand the requirements and provide guidance for the generation process.
arXiv Detail & Related papers (2025-01-23T15:17:51Z)
CodeTree: Agent-guided Tree Search for Code Generation with Large Language Models [106.11371409170818]
Large language models (LLMs) can act as agents with capabilities to self-refine and improve generated code autonomously. We propose CodeTree, a framework for LLM agents to efficiently explore the search space in different stages of the code generation process. Specifically, we adopted a unified tree structure to explicitly explore different coding strategies, generate corresponding coding solutions, and subsequently refine the solutions.
arXiv Detail & Related papers (2024-11-07T00:09:54Z)
AIME: AI System Optimization via Multiple LLM Evaluators [79.03422337674664]
AIME is an evaluation protocol that utilizes multiple LLMs that each independently generate an evaluation on separate criteria and then combine them via concatenation. We show AIME outperforming baseline methods in code generation tasks, with up to $62%$ higher error detection rate and up to $16%$ higher success rate than a single LLM evaluation protocol on LeetCodeHard and HumanEval datasets.
arXiv Detail & Related papers (2024-10-04T04:03:24Z)
Codev-Bench: How Do LLMs Understand Developer-Centric Code Completion? [60.84912551069379]
We present the Code-Development Benchmark (Codev-Bench), a fine-grained, real-world, repository-level, and developer-centric evaluation framework. Codev-Agent is an agent-based system that automates repository crawling, constructs execution environments, extracts dynamic calling chains from existing unit tests, and generates new test samples to avoid data leakage.
arXiv Detail & Related papers (2024-10-02T09:11:10Z)
Sifting through the Chaff: On Utilizing Execution Feedback for Ranking the Generated Code Candidates [46.74037090843497]
Large Language Models (LLMs) are transforming the way developers approach programming by automatically generating code based on natural language descriptions. This paper puts forward RankEF, an innovative approach for code ranking that leverages execution feedback. Experiments on three code generation benchmarks demonstrate that RankEF significantly outperforms the state-of-the-art CodeRanker.
arXiv Detail & Related papers (2024-08-26T01:48:57Z)
Repoformer: Selective Retrieval for Repository-Level Code Completion [30.706277772743615]
Recent advances in retrieval-augmented generation (RAG) have initiated a new era in repository-level code completion. In this paper, we propose a selective RAG framework to avoid retrieval when unnecessary. We show that our framework is able to accommodate different generation models, retrievers, and programming languages.
arXiv Detail & Related papers (2024-03-15T06:59:43Z)
StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback [58.20547418182074]
We introduce StepCoder, a novel framework for code generation, consisting of two main components. CCCS addresses the exploration challenge by breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks. FGO only optimize the model by masking the unexecuted code segments to provide Fine-Grained Optimization. Our method improves the ability to explore the output space and outperforms state-of-the-art approaches in corresponding benchmarks.
arXiv Detail & Related papers (2024-02-02T13:14:31Z)
A Review of Repository Level Prompting for LLMs [0.0]
Large Language Models (LLMs) have led to notable successes, such as achieving a 94.6% solve rate on the HumanEval benchmark. There is an increasing commercial push for repository-level inline code completion tools, such as GitHub Copilot and Tab Nine. This paper delves into the transition from individual coding problems to repository-scale solutions.
arXiv Detail & Related papers (2023-12-15T00:34:52Z)
RLTF: Reinforcement Learning from Unit Test Feedback [17.35361167578498]
Reinforcement Learning from Unit Test Feedback is a novel online RL framework with unit test feedback of multi-granularity for refining code LLMs. Our approach generates data in real-time during training and simultaneously utilizes fine-grained feedback signals to guide the model towards producing higher-quality code.
arXiv Detail & Related papers (2023-07-10T05:18:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.