How do Agents Refactor: An Empirical Study
- URL: http://arxiv.org/abs/2601.20160v1
- Date: Wed, 28 Jan 2026 01:34:15 GMT
- Title: How do Agents Refactor: An Empirical Study
- Authors: Lukas Ottenhof, Daniel Penner, Abram Hindle, Thibaud Lutellier,
- Abstract summary: We present the first analysis of agentic pull requests in Java.<n>We identify types and detect code smells before and after commits.<n>We find Cursor to be the only model to show a statistically significant increase in smells.
- Score: 2.7711196026307476
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Software development agents such as Claude Code, GitHub Copilot, Cursor Agent, Devin, and OpenAI Codex are being increasingly integrated into developer workflows. While prior work has evaluated agent capabilities for code completion and task automation, there is little work investigating how these agents perform Java refactoring in practice, the types of changes they make, and their impact on code quality. In this study, we present the first analysis of agentic refactoring pull requests in Java, comparing them to developer refactorings across 86 projects per group. Using RefactoringMiner and DesigniteJava 3.0, we identify refactoring types and detect code smells before and after refactoring commits. Our results show that agent refactorings are dominated by annotation changes (the 5 most common refactoring types done by agents are annotation related), in contrast to the diverse structural improvements typical of developers. Despite these differences in refactoring types, we find Cursor to be the only model to show a statistically significant increase in refactoring smells.
Related papers
- CodeTaste: Can LLMs Generate Human-Level Code Refactorings? [2.447746234944228]
Large language model (LLM) coding agents can generate working code, but their solutions often accumulate complexity, duplication, and architectural debt.<n>Human developers address such issues through: behavior-preserving program that improve structure and maintainability.<n>We present CodeTaste, a benchmark of tasks mined from large-scale multi-file changes in open-source repositories.
arXiv Detail & Related papers (2026-03-04T15:34:18Z) - From Human to Machine Refactoring: Assessing GPT-4's Impact on Python Class Quality and Readability [46.83143241367452]
Refactoring aims to improve code quality without altering program behavior.<n>Recent advances in Large Language Models (LLMs) have introduced new opportunities for automated code preservation.<n>We present an empirical study on LLM-driven classes using GPT-4o, applied to 100 Python classes from the ClassEval benchmark.<n>Our findings show that GPT-4o generally produces behavior-preservings that reduce code smells and improve quality metrics, albeit at the cost of decreased readability.
arXiv Detail & Related papers (2026-01-19T15:22:37Z) - Multi-Agent Coordinated Rename Refactoring [37.01164379102587]
The primary value of AI agents in software development lies in their ability to extend the developer's capacity for reasoning.<n>Coordinated renaming, where a single rename triggers contexts in multiple, related identifiers, is a frequent yet challenging task.<n>We designed, implemented, and evaluated the first multi-agent framework that automates coordinated renaming.
arXiv Detail & Related papers (2026-01-01T21:29:43Z) - Agentic Refactoring: An Empirical Study of AI Coding Agents [9.698067623031909]
Agentic coding tools, such as OpenAI Codex, Claude Code, and Cursor, are transforming the software engineering landscape.<n>These AI-powered systems function as autonomous teammates capable of planning and executing complex development tasks.<n>There is a critical lack of empirical understanding regarding how agentic is utilized in practice, how it compares to human-driven, and what impact it has on code quality.
arXiv Detail & Related papers (2025-11-06T21:24:38Z) - RefModel: Detecting Refactorings using Foundation Models [2.2670483018110366]
We investigate the viability of using foundation models for detection, implemented in a tool named RefModel.<n>We evaluate Phi4-14B, and Claude 3.5 Sonnet on a dataset of 858 single-operation transformations applied to artificially generated Java programs.<n>In real-world settings, Claude 3.5 Sonnet and Gemini 2.5 Pro jointly identified 97% of all transformations, surpassing the best-performing static-analysis-based tools.
arXiv Detail & Related papers (2025-07-15T14:20:56Z) - Assessing the Bug-Proneness of Refactored Code: A Longitudinal Multi-Project Study [43.65862440745159]
Refactoring is a common practice in software development, aimed at improving the internal code structure in order to make it easier to understand and modify.<n>It is often assumed that makes the code less prone to bugs.<n>However, in practice, is a complex task and applied in different ways. Therefore, certains can inadvertently make the code more prone to bugs.
arXiv Detail & Related papers (2025-05-12T19:12:30Z) - MANTRA: Enhancing Automated Method-Level Refactoring with Contextual RAG and Multi-Agent LLM Collaboration [44.75848695076576]
We introduce MANTRA, a comprehensive Large Language Models agent-based framework.<n>ManTRA integrates Context-Aware Retrieval-Augmented Generation, coordinated Multi-Agent Collaboration, and Verbal Reinforcement Learning.<n> Experimental results demonstrate that MANTRA substantially surpasses a baseline LLM model.
arXiv Detail & Related papers (2025-03-18T15:16:51Z) - Refactoring Detection in C++ Programs with RefactoringMiner++ [45.045206894182776]
We present RefactoringMiner++, a detection tool based on the current state of the art: RefactoringMiner 3.<n>While the latter focuses exclusively on Java, our tool is seeded -- to the best of our knowledge -- the first publicly available detection tool for C++ projects.
arXiv Detail & Related papers (2025-02-24T23:17:35Z) - RefBERT: A Two-Stage Pre-trained Framework for Automatic Rename
Refactoring [57.8069006460087]
We study automatic rename on variable names, which is considered more challenging than other rename activities.
We propose RefBERT, a two-stage pre-trained framework for rename on variable names.
We show that the generated variable names of RefBERT are more accurate and meaningful than those produced by the existing method.
arXiv Detail & Related papers (2023-05-28T12:29:39Z) - Do code refactorings influence the merge effort? [80.1936417993664]
Multiple contributors frequently change the source code in parallel to implement new features, fix bugs, existing code, and make other changes.
These simultaneous changes need to be merged into the same version of the source code.
Studies show that 10 to 20 percent of all merge attempts result in conflicts, which require the manual developer's intervention to complete the process.
arXiv Detail & Related papers (2023-05-10T13:24:59Z) - How We Refactor and How We Document it? On the Use of Supervised Machine
Learning Algorithms to Classify Refactoring Documentation [25.626914797750487]
Refactoring is the art of improving the design of a system without altering its external behavior.
This study categorizes commits into 3 categories, namely, Internal QA, External QA, and Code Smell Resolution, along with the traditional BugFix and Functional categories.
To better understand our classification results, we analyzed commit messages to extract patterns that developers regularly use to describe their smells.
arXiv Detail & Related papers (2020-10-26T20:33:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.