RefAgent: A Multi-agent LLM-based Framework for Automatic Software Refactoring
- URL: http://arxiv.org/abs/2511.03153v1
- Date: Wed, 05 Nov 2025 03:20:58 GMT
- Title: RefAgent: A Multi-agent LLM-based Framework for Automatic Software Refactoring
- Authors: Khouloud Oueslati, Maxime Lamothe, Foutse Khomh,
- Abstract summary: RefAgent is a multi-agent LLM-based framework for end-to-end software.<n>It consists of specialized agents responsible for planning, executing, and iteratively refining testing.<n>It achieves a median unit test pass rate of 90%, reduces code smells by a median of 52.5%, and improves key quality attributes by a median of 8.6%.
- Score: 8.038518812060897
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have substantially influenced various software engineering tasks. Indeed, in the case of software refactoring, traditional LLMs have shown the ability to reduce development time and enhance code quality. However, these LLMs often rely on static, detailed instructions for specific tasks. In contrast, LLM-based agents can dynamically adapt to evolving contexts and autonomously make decisions by interacting with software tools and executing workflows. In this paper, we explore the potential of LLM-based agents in supporting refactoring activities. Specifically, we introduce RefAgent, a multi-agent LLM-based framework for end-to-end software refactoring. RefAgent consists of specialized agents responsible for planning, executing, testing, and iteratively refining refactorings using self-reflection and tool-calling capabilities. We evaluate RefAgent on eight open-source Java projects, comparing its effectiveness against a single-agent approach, a search-based refactoring tool, and historical developer refactorings. Our assessment focuses on: (1) the impact of generated refactorings on software quality, (2) the ability to identify refactoring opportunities, and (3) the contribution of each LLM agent through an ablation study. Our results show that RefAgent achieves a median unit test pass rate of 90%, reduces code smells by a median of 52.5%, and improves key quality attributes (e.g., reusability) by a median of 8.6%. Additionally, it closely aligns with developer refactorings and the search-based tool in identifying refactoring opportunities, attaining a median F1-score of 79.15% and 72.7%, respectively. Compared to single-agent approaches, RefAgent improves the median unit test pass rate by 64.7% and the median compilation success rate by 40.1%. These findings highlight the promise of multi-agent architectures in advancing automated software refactoring.
Related papers
- SWE-Refactor: A Repository-Level Benchmark for Real-World LLM-Based Code Refactoring [20.694251041823097]
Large Language Models (LLMs) have attracted wide interest for tackling software engineering tasks.<n>Existing benchmarks commonly suffer from three shortcomings.<n>SWE-Refactor comprises 1,099 developer-written, behavior-preserving LLMs mined from 18 Java projects.
arXiv Detail & Related papers (2026-02-03T16:36:29Z) - LoCoBench-Agent: An Interactive Benchmark for LLM Agents in Long-Context Software Engineering [90.84806758077536]
We introduce textbfLoCoBench-Agent, a comprehensive evaluation framework specifically designed to assess large language models (LLMs) agents in realistic, long-context software engineering.<n>Our framework extends LoCoBench's 8,000 scenarios into interactive agent environments, enabling systematic evaluation of multi-turn conversations.<n>Our framework provides agents with 8 specialized tools (file operations, search, code analysis) and evaluates them across context lengths ranging from 10K to 1M tokens.
arXiv Detail & Related papers (2025-11-17T23:57:24Z) - Self-Abstraction from Grounded Experience for Plan-Guided Policy Refinement [61.35824395228412]
Large language model (LLM) based agents are increasingly used to tackle software engineering tasks.<n>We propose Self-Abstraction from Grounded Experience (SAGE), a framework that enables agents to learn from their own task executions.
arXiv Detail & Related papers (2025-11-08T08:49:38Z) - Agentic Refactoring: An Empirical Study of AI Coding Agents [9.698067623031909]
Agentic coding tools, such as OpenAI Codex, Claude Code, and Cursor, are transforming the software engineering landscape.<n>These AI-powered systems function as autonomous teammates capable of planning and executing complex development tasks.<n>There is a critical lack of empirical understanding regarding how agentic is utilized in practice, how it compares to human-driven, and what impact it has on code quality.
arXiv Detail & Related papers (2025-11-06T21:24:38Z) - LLM-based Multi-Agent System for Intelligent Refactoring of Haskell Code [3.8442921307218882]
We propose a large language model (LLM)-based multi-agent system to automate the process on Haskell code.<n>Results show that the proposed multi-agent system could average 11.03% decreased complexity in code, an improvement of 22.46% in overall code quality, and increase performance efficiency by an average of 13.27%.
arXiv Detail & Related papers (2025-06-24T10:17:34Z) - R&D-Agent: An LLM-Agent Framework Towards Autonomous Data Science [70.1638335489284]
High-level machine learning engineering tasks remain labor-intensive and iterative.<n>We introduce R&D-Agent, a comprehensive, decoupled, and framework that formalizes the machine learning process.<n>R&D-Agent defines the MLE into two phases and six components, turning agent design for MLE into a principled, testable process.
arXiv Detail & Related papers (2025-05-20T06:07:00Z) - MANTRA: Enhancing Automated Method-Level Refactoring with Contextual RAG and Multi-Agent LLM Collaboration [44.75848695076576]
We introduce MANTRA, a comprehensive Large Language Models agent-based framework.<n>ManTRA integrates Context-Aware Retrieval-Augmented Generation, coordinated Multi-Agent Collaboration, and Verbal Reinforcement Learning.<n> Experimental results demonstrate that MANTRA substantially surpasses a baseline LLM model.
arXiv Detail & Related papers (2025-03-18T15:16:51Z) - ReMA: Learning to Meta-think for LLMs with Multi-Agent Reinforcement Learning [53.817538122688944]
We introduce Reinforced Meta-thinking Agents (ReMA) to elicit meta-thinking behaviors from Reasoning of Large Language Models (LLMs)<n>ReMA decouples the reasoning process into two hierarchical agents: a high-level meta-thinking agent responsible for generating strategic oversight and plans, and a low-level reasoning agent for detailed executions.<n> Empirical results from single-turn experiments demonstrate that ReMA outperforms single-agent RL baselines on complex reasoning tasks.
arXiv Detail & Related papers (2025-03-12T16:05:31Z) - Improving Retrospective Language Agents via Joint Policy Gradient Optimization [57.35348425288859]
RetroAct is a framework that jointly optimize both task-planning and self-reflective evolution capabilities in language agents.<n>We develop a two-stage joint optimization process that integrates imitation learning and reinforcement learning.<n>We conduct extensive experiments across various testing environments, demonstrating RetroAct has substantial improvements in task performance and decision-making processes.
arXiv Detail & Related papers (2025-03-03T12:54:54Z) - An Empirical Study on the Potential of LLMs in Automated Software Refactoring [9.157968996300417]
We investigate the potential of large language models (LLMs) in automated software.
We find that 13 out of the 176 solutions suggested by ChatGPT and 9 out of the 137 solutions suggested by Gemini were unsafe in that they either changed the functionality of the source code or introduced syntax errors.
arXiv Detail & Related papers (2024-11-07T05:35:55Z) - Together We Go Further: LLMs and IDE Static Analysis for Extract Method Refactoring [9.882903340467815]
Long methods that encapsulate multiple responsibilities within a single method are challenging to maintain.
Large Language Models (LLMs) have been trained on large code corpora.
LLMs are very effective for giving expert suggestions, yet they are unreliable: up to 76.3% of the suggestions are hallucinations.
arXiv Detail & Related papers (2024-01-27T05:01:03Z) - AgentBench: Evaluating LLMs as Agents [99.12825098528212]
Large Language Model (LLM) as agents has been widely acknowledged recently.<n>We present AgentBench, a benchmark that consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities.
arXiv Detail & Related papers (2023-08-07T16:08:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.