Related papers: An Empirical Study on the Code Refactoring Capability of Large Language Models

An Empirical Study on the Code Refactoring Capability of Large Language Models

URL: http://arxiv.org/abs/2411.02320v1
Date: Mon, 04 Nov 2024 17:46:20 GMT
Title: An Empirical Study on the Code Refactoring Capability of Large Language Models
Authors: Jonathan Cordeiro, Shayan Noei, Ying Zou,
Abstract summary: This study empirically evaluates StarCoder2, an LLM optimized for code generation, in code across 30 open-source Java projects. We compare StarCoder2's performance against human developers, focusing on (1) code quality improvements, (2) types and effectiveness of smells, and (3) enhancements through one-shot and chain-of-thought prompting.
Score: 0.5852077003870416
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have shown potential to enhance software development through automated code generation and refactoring, reducing development time and improving code quality. This study empirically evaluates StarCoder2, an LLM optimized for code generation, in refactoring code across 30 open-source Java projects. We compare StarCoder2's performance against human developers, focusing on (1) code quality improvements, (2) types and effectiveness of refactorings, and (3) enhancements through one-shot and chain-of-thought prompting. Our results indicate that StarCoder2 reduces code smells by 20.1% more than developers, excelling in systematic issues like Long Statement and Magic Number, while developers handle complex, context-dependent issues better. One-shot prompting increases the unit test pass rate by 6.15% and improves code smell reduction by 3.52%. Generating five refactorings per input further increases the pass rate by 28.8%, suggesting that combining one-shot prompting with multiple refactorings optimizes performance. These findings provide insights into StarCoder2's potential and best practices for integrating LLMs into software refactoring, supporting more efficient and effective code improvement in real-world applications.

Related papers

IFEvalCode: Controlled Code Generation [69.28317223249358]
The paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs.<n>The authors present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages.
arXiv Detail & Related papers (2025-07-30T08:08:48Z)
Turning the Tide: Repository-based Code Reflection [52.13709676656648]
We introduce LiveRepoReflection, a benchmark for evaluating code understanding and generation in multi-file repository contexts.<n>1,888 rigorously filtered test cases across $6$ programming languages to ensure diversity, correctness, and high difficulty.<n>We also create RepoReflection-Instruct, a large-scale, quality-filtered instruction-tuning dataset derived from diverse sources.
arXiv Detail & Related papers (2025-07-14T02:36:27Z)
ACE: Automated Technical Debt Remediation with Validated Large Language Model Refactorings [8.0322025529523]
This paper introduces Augmented Code Engineering (ACE), a tool that automates code improvements using validated output.<n>Early feedback from users suggests that AI-enabled helps mitigate code-level technical debt that otherwise rarely gets acted upon.
arXiv Detail & Related papers (2025-07-04T12:39:27Z)
LLM-based Multi-Agent System for Intelligent Refactoring of Haskell Code [3.8442921307218882]
We propose a large language model (LLM)-based multi-agent system to automate the process on Haskell code.<n>Results show that the proposed multi-agent system could average 11.03% decreased complexity in code, an improvement of 22.46% in overall code quality, and increase performance efficiency by an average of 13.27%.
arXiv Detail & Related papers (2025-06-24T10:17:34Z)
Refactoring Codebases through Library Design [5.7905916281782215]
We present a benchmark and a method for generating reusable libraries.<n>Compared to state-of-the-art code agents, Librarian achieves strong results on both compression and correctness on Minicode.<n>We open-source our code and benchmark at https://code-refactor.io/.
arXiv Detail & Related papers (2025-05-26T07:26:33Z)
MANTRA: Enhancing Automated Method-Level Refactoring with Contextual RAG and Multi-Agent LLM Collaboration [44.75848695076576]
We introduce MANTRA, a comprehensive Large Language Models agent-based framework. ManTRA integrates Context-Aware Retrieval-Augmented Generation, coordinated Multi-Agent Collaboration, and Verbal Reinforcement Learning. Experimental results demonstrate that MANTRA substantially surpasses a baseline LLM model.
arXiv Detail & Related papers (2025-03-18T15:16:51Z)
An Empirical Study on the Impact of Code Duplication-aware Refactoring Practices on Quality Metrics [5.516979718589074]
We extract a corpus of 332 commits applied and documented by developers during their daily changes from 128 open-source Java projects. We empirically analyze the impact of these operations on a set of common state-of-the-art design quality metrics.
arXiv Detail & Related papers (2025-02-06T13:34:25Z)
Generating refactored code accurately using reinforcement learning [3.179831861897336]
We propose a novel reinforcement learning-based approach for fine-tuning and aligning code language models to perform automated, intelligent extract method on Java source code. Our approach fine-tunes sequence-to-sequence generative models and aligns them using the Proximal Policy Optimization (PPO) algorithm. Our experiments demonstrate that our approach significantly enhances the performance of large language models in code.
arXiv Detail & Related papers (2024-12-23T23:09:48Z)
Effi-Code: Unleashing Code Efficiency in Language Models [17.355845751737423]
Effi-Code is an approach to enhancing code generation in large language models. Effi-Code offers a scalable and generalizable approach to improving code generation in AI systems.
arXiv Detail & Related papers (2024-10-14T07:05:51Z)
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions. We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types. We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z)
SOEN-101: Code Generation by Emulating Software Process Models Using Large Language Model Agents [50.82665351100067]
FlowGen is a code generation framework that emulates software process models based on multiple Large Language Model (LLM) agents. We evaluate FlowGenScrum on four benchmarks: HumanEval, HumanEval-ET, MBPP, and MBPP-ET.
arXiv Detail & Related papers (2024-03-23T14:04:48Z)
StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback [58.20547418182074]
We introduce StepCoder, a novel framework for code generation, consisting of two main components. CCCS addresses the exploration challenge by breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks. FGO only optimize the model by masking the unexecuted code segments to provide Fine-Grained Optimization. Our method improves the ability to explore the output space and outperforms state-of-the-art approaches in corresponding benchmarks.
arXiv Detail & Related papers (2024-02-02T13:14:31Z)
ReGAL: Refactoring Programs to Discover Generalizable Abstractions [59.05769810380928]
Generalizable Abstraction Learning (ReGAL) is a method for learning a library of reusable functions via codeization. We find that the shared function libraries discovered by ReGAL make programs easier to predict across diverse domains. For CodeLlama-13B, ReGAL results in absolute accuracy increases of 11.5% on LOGO, 26.1% on date understanding, and 8.1% on TextCraft, outperforming GPT-3.5 in two of three domains.
arXiv Detail & Related papers (2024-01-29T18:45:30Z)
LLM-Assisted Code Cleaning For Training Accurate Code Generators [53.087019724256606]
We investigate data quality for code and find that making the code more structured and readable leads to improved code generation performance of the system. We build a novel data-cleaning pipeline that uses these principles to transform existing programs. We evaluate our approach on two challenging algorithmic code generation benchmarks and find that fine-tuning CodeLLaMa-7B improves the performance by up to 30% compared to fine-tuning on the original dataset.
arXiv Detail & Related papers (2023-11-25T02:45:50Z)
Empirical Evaluation of a Live Environment for Extract Method Refactoring [0.0]
We developed a Live Refactoring Environment that visually identifies, recommends, and applies Extract Methods. Our results were significantly different and better than the ones from the code manually without further help.
arXiv Detail & Related papers (2023-07-20T16:36:02Z)
Do code refactorings influence the merge effort? [80.1936417993664]
Multiple contributors frequently change the source code in parallel to implement new features, fix bugs, existing code, and make other changes. These simultaneous changes need to be merged into the same version of the source code. Studies show that 10 to 20 percent of all merge attempts result in conflicts, which require the manual developer's intervention to complete the process.
arXiv Detail & Related papers (2023-05-10T13:24:59Z)
How We Refactor and How We Document it? On the Use of Supervised Machine Learning Algorithms to Classify Refactoring Documentation [25.626914797750487]
Refactoring is the art of improving the design of a system without altering its external behavior. This study categorizes commits into 3 categories, namely, Internal QA, External QA, and Code Smell Resolution, along with the traditional BugFix and Functional categories. To better understand our classification results, we analyzed commit messages to extract patterns that developers regularly use to describe their smells.
arXiv Detail & Related papers (2020-10-26T20:33:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.