Related papers: Failure-Aware Enhancements for Large Language Model (LLM) Code Generation: An Empirical Study on Decision Framework

Failure-Aware Enhancements for Large Language Model (LLM) Code Generation: An Empirical Study on Decision Framework

URL: http://arxiv.org/abs/2602.02896v1
Date: Mon, 02 Feb 2026 23:08:03 GMT
Title: Failure-Aware Enhancements for Large Language Model (LLM) Code Generation: An Empirical Study on Decision Framework
Authors: Jianru Shen, Zedong Peng, Lucy Owen,
Abstract summary: In an empirical study of 25 GitHub projects, we found that progressive prompting achieves 96.9% average task completion.<n>Self-critique succeeds on code-reviewable logic errors but fails completely on external service integration.<n>RAG achieves highest completion across all failure types with superior efficiency.
Score: 0.26508608365976566
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) show promise for automating software development by translating requirements into code. However, even advanced prompting workflows like progressive prompting often leave some requirements unmet. Although methods such as self-critique, multi-model collaboration, and retrieval-augmented generation (RAG) have been proposed to address these gaps, developers lack clear guidance on when to use each. In an empirical study of 25 GitHub projects, we found that progressive prompting achieves 96.9% average task completion, significantly outperforming direct prompting (80.5%, Cohen's d=1.63, p<0.001) but still leaving 8 projects incomplete. For 6 of the most representative projects, we evaluated each enhancement strategy across 4 failure types. Our results reveal that method effectiveness depends critically on failure characteristics: Self-Critique succeeds on code-reviewable logic errors but fails completely on external service integration (0% improvement), while RAG achieves highest completion across all failure types with superior efficiency. Based on these findings, we propose a decision framework that maps each failure pattern to the most suitable enhancement method, giving practitioners practical, data-driven guidance instead of trial-and-error.

Related papers

Holistic Evaluation of State-of-the-Art LLMs for Code Generation [5.504955093712013]
DeepSeek-R1 and GPT-4.1 consistently outperform others in terms of correctness, efficiency, and robustness.<n>We identify common failure scenarios such as syntax errors, logical flaws, and suboptimal algorithms.
arXiv Detail & Related papers (2025-12-19T23:29:05Z)
SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models [59.90381306452982]
evaluating large language models (LLMs) for software engineering has been limited by narrow task coverage, language bias, and insufficient alignment with real-world developer.<n>We introduce SWE-1, a comprehensive benchmark that unifies heterogeneous code-related evaluations into a structured and production-aligned framework.<n>SWE- spans 8 task types, 8 programming scenarios, and 10 programming languages, with 2000 high-quality instances curated from authentic GitHub pull requests.
arXiv Detail & Related papers (2025-11-07T18:01:32Z)
MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization [103.74675519953898]
Long-chain reflective reasoning is a prerequisite for solving complex real-world problems.<n>We build a benchmark consisting 1,260 samples of 42 challenging synthetic tasks.<n>We generate post-training data and explore learning paradigms for exploiting such data.
arXiv Detail & Related papers (2025-10-09T17:53:58Z)
DecEx-RAG: Boosting Agentic Retrieval-Augmented Generation with Decision and Execution Optimization via Process Supervision [50.89715397781075]
Agentic Retrieval-Augmented Generation (Agentic RAG) enhances the processing capability for complex tasks.<n>We propose DecEx-RAG, which models RAG as a Markov Decision Process (MDP) incorporating decision-making and execution.<n>We show that DecEx-RAG achieves an average absolute performance improvement of $6.2%$ across six datasets.
arXiv Detail & Related papers (2025-10-07T08:49:22Z)
A Stitch in Time Saves Nine: Proactive Self-Refinement for Language Models [53.31664844941449]
ProActive Self-Refinement (PASR) is a novel method for improving large language models (LLMs)<n>Unlike methods that regenerate entire responses, PASR proactively decides whether, when, and how to refine based on the model's internal state and evolving context.<n>We conduct extensive experiments on a diverse set of 10 tasks to evaluate the effectiveness of PASR.
arXiv Detail & Related papers (2025-08-18T13:07:21Z)
OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks [52.87238755666243]
We present OmniEAR, a framework for evaluating how language models reason about physical interactions, tool usage, and multi-agent coordination in embodied tasks.<n>We model continuous physical properties and complex spatial relationships across 1,500 scenarios spanning household and industrial domains.<n>Our systematic evaluation reveals severe performance degradation when models must reason from constraints.
arXiv Detail & Related papers (2025-08-07T17:54:15Z)
Impact of Code Context and Prompting Strategies on Automated Unit Test Generation with Modern General-Purpose Large Language Models [0.0]
Generative AI is gaining increasing attention in software engineering.<n>Unit tests constitute the majority of test cases and are often schematic.<n>This paper investigates the impact of code context and prompting strategies on the quality and adequacy of unit tests.
arXiv Detail & Related papers (2025-07-18T11:23:17Z)
Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models.<n>Our framework incorporates two complementary strategies: internal TTC and external TTC.<n>We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z)
Chain of Draft for Software Engineering: Challenges in Applying Concise Reasoning to Code Tasks [0.0]
This research extends the Chain of Draft (CoD) method to software engineering.<n>All CoD variants used significantly fewer tokens than Chain of Thought (CoT)<n>CoD variants maintain over 90% of CoT's code quality across key metrics including correctness, compatibility, and maintainability.
arXiv Detail & Related papers (2025-03-12T07:44:18Z)
Chain-of-Reasoning: Towards Unified Mathematical Reasoning in Large Language Models via a Multi-Paradigm Perspective [98.29190911211053]
Chain-of-Reasoning (CoR) is a novel unified framework integrating multiple reasoning paradigms.<n>CoR generates multiple potential answers via different reasoning paradigms and synthesizes them into a coherent final solution.<n> Experimental results demonstrate that CoR-Math-7B significantly outperforms current SOTA models.
arXiv Detail & Related papers (2025-01-19T16:53:26Z)
Reasoning Through Execution: Unifying Process and Outcome Rewards for Code Generation [27.484259938667776]
Large Language Models excel at code generation yet struggle with complex programming tasks that demand reasoning.<n>We introduce Outcome Refining Process Supervision, which unifies process and outcome supervision by leveraging executable verification.<n>Experiments across 5 models and 3 benchmarks show consistent gains, with 26.9% higher correctness and 42.2% improved code efficiency.
arXiv Detail & Related papers (2024-12-19T17:59:42Z)
LLM Agents Improve Semantic Code Search [6.047454623201181]
We introduce the approach of using Retrieval Augmented Generation powered agents to inject information into user prompts. By utilizing RAG, agents enhance user queries with relevant details from GitHub repositories, making them more informative and contextually aligned. Experimental results on the CodeSearchNet dataset demonstrate that RepoRift significantly outperforms existing methods.
arXiv Detail & Related papers (2024-08-05T00:43:56Z)
Masked Thought: Simply Masking Partial Reasoning Steps Can Improve Mathematical Reasoning Learning of Language Models [102.72940700598055]
In reasoning tasks, even a minor error can cascade into inaccurate results. We develop a method that avoids introducing external resources, relying instead on perturbations to the input. Our training approach randomly masks certain tokens within the chain of thought, a technique we found to be particularly effective for reasoning tasks.
arXiv Detail & Related papers (2024-03-04T16:21:54Z)
Language Models for Code Completion: A Practical Evaluation [13.174471984950857]
This study provides both quantitative and qualitative assessments of three public code language models when completing real-world code. We collected real auto-completion usage data for over a year from more than 1200 users, resulting in over 600K valid completions. We found that 66.3% of failures were due to the models' limitations, 24.4% occurred due to inappropriate model usage in a development context, and 9.3% were valid requests that developers overwrote.
arXiv Detail & Related papers (2024-02-25T20:43:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.