Related papers: Enhancing LLM-Based Code Generation with Complexity Metrics: A Feedback-Driven Approach

Enhancing LLM-Based Code Generation with Complexity Metrics: A Feedback-Driven Approach

URL: http://arxiv.org/abs/2505.23953v1
Date: Thu, 29 May 2025 19:06:14 GMT
Title: Enhancing LLM-Based Code Generation with Complexity Metrics: A Feedback-Driven Approach
Authors: Melika Sepidband, Hamed Taherkhani, Song Wang, Hadi Hemmati,
Abstract summary: We investigate the relationship between code complexity and the success of Large Language Models generated code.<n>We propose an iterative feedback method, where LLMs are prompted to generate correct code based on complexity metrics from previous failed outputs.<n>Experiment results show that our approach makes notable improvements, particularly with a smaller LLM.
Score: 6.289275189295223
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automatic code generation has gained significant momentum with the advent of Large Language Models (LLMs) such as GPT-4. Although many studies focus on improving the effectiveness of LLMs for code generation, very limited work tries to understand the generated code's characteristics and leverage that to improve failed cases. In this paper, as the most straightforward characteristic of code, we investigate the relationship between code complexity and the success of LLM generated code. Using a large set of standard complexity metrics, we first conduct an empirical analysis to explore their correlation with LLM's performance on code generation (i.e., Pass@1). Using logistic regression models, we identify which complexity metrics are most predictive of code correctness. Building on these findings, we propose an iterative feedback method, where LLMs are prompted to generate correct code based on complexity metrics from previous failed outputs. We validate our approach across multiple benchmarks (i.e., HumanEval, MBPP, LeetCode, and BigCodeBench) and various LLMs (i.e., GPT-4o, GPT-3.5 Turbo, Llama 3.1, and GPT-o3 mini), comparing the results with two baseline methods: (a) zero-shot generation, and (b) iterative execution-based feedback without our code complexity insights. Experiment results show that our approach makes notable improvements, particularly with a smaller LLM (GPT3.5 Turbo), where, e.g., Pass@1 increased by 35.71% compared to the baseline's improvement of 12.5% on the HumanEval dataset. The study expands experiments to BigCodeBench and integrates the method with the Reflexion code generation agent, leading to Pass@1 improvements of 20% (GPT-4o) and 23.07% (GPT-o3 mini). The results highlight that complexity-aware feedback enhances both direct LLM prompting and agent-based workflows.

Related papers

Evaluating and Achieving Controllable Code Completion in Code LLM [89.64782747840225]
We present the first instruction-guided code completion benchmark, Controllable Code Completion Benchmark (C3-Bench)<n>We reveal substantial gaps in instruction-following capabilities between open-source and advanced proprietary models during code completion tasks.<n>The resulting model, Qwen2.5-Coder-C3, achieves state-of-the-art performance on C3-Bench.
arXiv Detail & Related papers (2026-01-22T11:40:04Z)
An Experimental Study of Real-Life LLM-Proposed Performance Improvements [2.503024366864326]
Large Language Models (LLMs) can generate code, but can they generate fast code?<n>We study this question using a dataset of 65 real-world tasks mined from open-source Java programs.
arXiv Detail & Related papers (2025-10-17T10:06:52Z)
Enhancing LLM-based Fault Localization with a Functionality-Aware Retrieval-Augmented Generation Framework [14.287359838639608]
FaR-Loc is a framework that enhances method-level fault localization.<n> FaR-Loc consists of three key components: LLM Functionality Extraction, Semantic Retrieval, and LLM Re-ranking.<n>Our experiments on the widely used Defects4J benchmark show that FaR-Loc outperforms state-of-the-art LLM-based baselines.
arXiv Detail & Related papers (2025-09-24T20:37:11Z)
Teaching LLM to Reason: Reinforcement Learning from Algorithmic Problems without Code [76.80306464249217]
We propose TeaR, which aims at teaching LLMs to reason better.<n>TeaR leverages careful data curation and reinforcement learning to guide models in discovering optimal reasoning paths through code-related tasks.<n>We conduct extensive experiments using two base models and three long-CoT distillation models, with model sizes ranging from 1.5 billion to 32 billion parameters, and across 17 benchmarks spanning Math, Knowledge, Code, and Logical Reasoning.
arXiv Detail & Related papers (2025-07-10T07:34:05Z)
PerfCodeGen: Improving Performance of LLM Generated Code with Execution Feedback [78.89596149768458]
Large Language Models (LLMs) are widely adopted for assisting in software development tasks.<n>We propose PerfCodeGen, a training-free framework that enhances the performance of LLM-generated code.
arXiv Detail & Related papers (2024-11-18T06:22:38Z)
Large Language Models as Code Executors: An Exploratory Study [29.545321608864295]
This paper pioneers the exploration of Large Language Models (LLMs) as code executors. We are the first to examine this feasibility across various LLMs, including OpenAI's o1, GPT-4o, GPT-3.5, DeepSeek, and Qwen-Coder. We introduce an Iterative Instruction Prompting (IIP) technique that processes code snippets line by line, enhancing the accuracy of weaker models by an average of 7.22%.
arXiv Detail & Related papers (2024-10-09T08:23:22Z)
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions. We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types. We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z)
Applying RLAIF for Code Generation with API-usage in Lightweight LLMs [15.366324461797582]
Reinforcement Learning from AI Feedback (RLAIF) has demonstrated significant potential across various domains. This paper introduces an RLAIF framework for improving the code generation abilities of lightweight (1B parameters) LLMs.
arXiv Detail & Related papers (2024-06-28T17:16:03Z)
Natural Is The Best: Model-Agnostic Code Simplification for Pre-trained Large Language Models [6.646510073473929]
We propose SlimCode, a model-agnostic code simplification solution for Large Language Models. SlimCode can improve the state-of-the-art technique by 9.46% and 5.15% in terms of MRR and BLEU score on code search and summarization.
arXiv Detail & Related papers (2024-05-18T06:15:52Z)
SOEN-101: Code Generation by Emulating Software Process Models Using Large Language Model Agents [50.82665351100067]
FlowGen is a code generation framework that emulates software process models based on multiple Large Language Model (LLM) agents. We evaluate FlowGenScrum on four benchmarks: HumanEval, HumanEval-ET, MBPP, and MBPP-ET.
arXiv Detail & Related papers (2024-03-23T14:04:48Z)
StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback [58.20547418182074]
We introduce StepCoder, a novel framework for code generation, consisting of two main components. CCCS addresses the exploration challenge by breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks. FGO only optimize the model by masking the unexecuted code segments to provide Fine-Grained Optimization. Our method improves the ability to explore the output space and outperforms state-of-the-art approaches in corresponding benchmarks.
arXiv Detail & Related papers (2024-02-02T13:14:31Z)
LLM-Assisted Code Cleaning For Training Accurate Code Generators [53.087019724256606]
We investigate data quality for code and find that making the code more structured and readable leads to improved code generation performance of the system. We build a novel data-cleaning pipeline that uses these principles to transform existing programs. We evaluate our approach on two challenging algorithmic code generation benchmarks and find that fine-tuning CodeLLaMa-7B improves the performance by up to 30% compared to fine-tuning on the original dataset.
arXiv Detail & Related papers (2023-11-25T02:45:50Z)
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks. To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.