Related papers: Agentic Refactoring: An Empirical Study of AI Coding Agents

Agentic Refactoring: An Empirical Study of AI Coding Agents

URL: http://arxiv.org/abs/2511.04824v1
Date: Thu, 06 Nov 2025 21:24:38 GMT
Title: Agentic Refactoring: An Empirical Study of AI Coding Agents
Authors: Kosei Horikawa, Hao Li, Yutaro Kashiwa, Bram Adams, Hajimu Iida, Ahmed E. Hassan,
Abstract summary: Agentic coding tools, such as OpenAI Codex, Claude Code, and Cursor, are transforming the software engineering landscape.<n>These AI-powered systems function as autonomous teammates capable of planning and executing complex development tasks.<n>There is a critical lack of empirical understanding regarding how agentic is utilized in practice, how it compares to human-driven, and what impact it has on code quality.
Score: 9.698067623031909
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Agentic coding tools, such as OpenAI Codex, Claude Code, and Cursor, are transforming the software engineering landscape. These AI-powered systems function as autonomous teammates capable of planning and executing complex development tasks. Agents have become active participants in refactoring, a cornerstone of sustainable software development aimed at improving internal code quality without altering observable behavior. Despite their increasing adoption, there is a critical lack of empirical understanding regarding how agentic refactoring is utilized in practice, how it compares to human-driven refactoring, and what impact it has on code quality. To address this empirical gap, we present a large-scale study of AI agent-generated refactorings in real-world open-source Java projects, analyzing 15,451 refactoring instances across 12,256 pull requests and 14,988 commits derived from the AIDev dataset. Our empirical analysis shows that refactoring is a common and intentional activity in this development paradigm, with agents explicitly targeting refactoring in 26.1% of commits. Analysis of refactoring types reveals that agentic efforts are dominated by low-level, consistency-oriented edits, such as Change Variable Type (11.8%), Rename Parameter (10.4%), and Rename Variable (8.5%), reflecting a preference for localized improvements over the high-level design changes common in human refactoring. Additionally, the motivations behind agentic refactoring focus overwhelmingly on internal quality concerns, with maintainability (52.5%) and readability (28.1%). Furthermore, quantitative evaluation of code quality metrics shows that agentic refactoring yields small but statistically significant improvements in structural metrics, particularly for medium-level changes, reducing class size and complexity (e.g., Class LOC median $\Delta$ = -15.25).

Related papers

SWE-Refactor: A Repository-Level Benchmark for Real-World LLM-Based Code Refactoring [20.694251041823097]
Large Language Models (LLMs) have attracted wide interest for tackling software engineering tasks.<n>Existing benchmarks commonly suffer from three shortcomings.<n>SWE-Refactor comprises 1,099 developer-written, behavior-preserving LLMs mined from 18 Java projects.
arXiv Detail & Related papers (2026-02-03T16:36:29Z)
How do Agents Refactor: An Empirical Study [2.7711196026307476]
We present the first analysis of agentic pull requests in Java.<n>We identify types and detect code smells before and after commits.<n>We find Cursor to be the only model to show a statistically significant increase in smells.
arXiv Detail & Related papers (2026-01-28T01:34:15Z)
From Human to Machine Refactoring: Assessing GPT-4's Impact on Python Class Quality and Readability [46.83143241367452]
Refactoring aims to improve code quality without altering program behavior.<n>Recent advances in Large Language Models (LLMs) have introduced new opportunities for automated code preservation.<n>We present an empirical study on LLM-driven classes using GPT-4o, applied to 100 Python classes from the ClassEval benchmark.<n>Our findings show that GPT-4o generally produces behavior-preservings that reduce code smells and improve quality metrics, albeit at the cost of decreased readability.
arXiv Detail & Related papers (2026-01-19T15:22:37Z)
Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning [84.70211451226835]
Large Language Model (LLM) Agents are constrained by a dependency on human-curated data.<n>We introduce Agent0, a fully autonomous framework that evolves high-performing agents without external data.<n>Agent0 substantially boosts reasoning capabilities, improving the Qwen3-8B-Base model by 18% on mathematical reasoning and 24% on general reasoning benchmarks.
arXiv Detail & Related papers (2025-11-20T05:01:57Z)
RefAgent: A Multi-agent LLM-based Framework for Automatic Software Refactoring [8.038518812060897]
RefAgent is a multi-agent LLM-based framework for end-to-end software.<n>It consists of specialized agents responsible for planning, executing, and iteratively refining testing.<n>It achieves a median unit test pass rate of 90%, reduces code smells by a median of 52.5%, and improves key quality attributes by a median of 8.6%.
arXiv Detail & Related papers (2025-11-05T03:20:58Z)
Refactoring $\neq$ Bug-Inducing: Improving Defect Prediction with Code Change Tactics Analysis [54.361900378970134]
Just-in-time defect prediction (JIT-DP) aims to predict the likelihood of code changes resulting in software defects at an early stage.<n>Prior research has largely ignored code during both the evaluation and methodology phases, despite its prevalence.<n>We propose Code chAnge Tactics (CAT) analysis to categorize code and its propagation, which improves labeling accuracy in the JIT-Defects4J dataset by 13.7%.
arXiv Detail & Related papers (2025-07-25T23:29:25Z)
Relating Complexity, Explicitness, Effectiveness of Refactorings and Non-Functional Requirements: A Replication Study [39.82126443893643]
Self-affirmed (SAR) is where developers explicitly state their intent to simplify requirements.<n>This study expanded the scope of Soares et al.'s study by doubling the number of projects and a significantly larger set of validated instances.<n>We observed that when developers explicitly state their intent, the resulting changes typically involve a combination of different types, making them more complex.
arXiv Detail & Related papers (2025-05-12T19:26:33Z)
Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models.<n>Our framework incorporates two complementary strategies: internal TTC and external TTC.<n>We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z)
MANTRA: Enhancing Automated Method-Level Refactoring with Contextual RAG and Multi-Agent LLM Collaboration [44.75848695076576]
We introduce MANTRA, a comprehensive Large Language Models agent-based framework.<n>ManTRA integrates Context-Aware Retrieval-Augmented Generation, coordinated Multi-Agent Collaboration, and Verbal Reinforcement Learning.<n> Experimental results demonstrate that MANTRA substantially surpasses a baseline LLM model.
arXiv Detail & Related papers (2025-03-18T15:16:51Z)
Gödel Agent: A Self-Referential Agent Framework for Recursive Self-Improvement [112.04307762405669]
G"odel Agent is a self-evolving framework inspired by the G"odel machine.<n>G"odel Agent can achieve continuous self-improvement, surpassing manually crafted agents in performance, efficiency, and generalizability.
arXiv Detail & Related papers (2024-10-06T10:49:40Z)
Do code refactorings influence the merge effort? [80.1936417993664]
Multiple contributors frequently change the source code in parallel to implement new features, fix bugs, existing code, and make other changes. These simultaneous changes need to be merged into the same version of the source code. Studies show that 10 to 20 percent of all merge attempts result in conflicts, which require the manual developer's intervention to complete the process.
arXiv Detail & Related papers (2023-05-10T13:24:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.