Related papers: From Human to Machine Refactoring: Assessing GPT-4's Impact on Python Class Quality and Readability

From Human to Machine Refactoring: Assessing GPT-4's Impact on Python Class Quality and Readability

URL: http://arxiv.org/abs/2601.13139v1
Date: Mon, 19 Jan 2026 15:22:37 GMT
Title: From Human to Machine Refactoring: Assessing GPT-4's Impact on Python Class Quality and Readability
Authors: Alessandro Midolo, Emiliano Tramontana, Massimiliano Di Penta,
Abstract summary: Refactoring aims to improve code quality without altering program behavior.<n>Recent advances in Large Language Models (LLMs) have introduced new opportunities for automated code preservation.<n>We present an empirical study on LLM-driven classes using GPT-4o, applied to 100 Python classes from the ClassEval benchmark.<n>Our findings show that GPT-4o generally produces behavior-preservings that reduce code smells and improve quality metrics, albeit at the cost of decreased readability.
Score: 46.83143241367452
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Refactoring is a software engineering practice that aims to improve code quality without altering program behavior. Although automated refactoring tools have been extensively studied, their practical applicability remains limited. Recent advances in Large Language Models (LLMs) have introduced new opportunities for automated code refactoring. The evaluation of such an LLM-driven approach, however, leaves unanswered questions about its effects on code quality. In this paper, we present a comprehensive empirical study on LLM-driven refactoring using GPT-4o, applied to 100 Python classes from the ClassEval benchmark. Unlike prior work, our study explores a wide range of class-level refactorings inspired by Fowler's catalog and evaluates their effects from three complementary perspectives: (i) behavioral correctness, verified through unit tests; (ii) code quality, assessed via Pylint, Flake8, and SonarCloud; and (iii) readability, measured using a state-of-the-art readability tool. Our findings show that GPT-4o generally produces behavior-preserving refactorings that reduce code smells and improve quality metrics, albeit at the cost of decreased readability. Our results provide new evidence on the capabilities and limitations of LLMs in automated software refactoring, highlighting directions for integrating LLMs into practical refactoring workflows.

Related papers

From Restructuring to Stabilization: A Large-Scale Experiment on Iterative Code Readability Refactoring with Large Language Models [5.31828955342405]
Large language models (LLMs) are increasingly used for automated code tasks.<n>This article systematically study the capabilities of LLMs for code readability.
arXiv Detail & Related papers (2026-02-25T12:05:25Z)
SWE-Refactor: A Repository-Level Benchmark for Real-World LLM-Based Code Refactoring [20.694251041823097]
Large Language Models (LLMs) have attracted wide interest for tackling software engineering tasks.<n>Existing benchmarks commonly suffer from three shortcomings.<n>SWE-Refactor comprises 1,099 developer-written, behavior-preserving LLMs mined from 18 Java projects.
arXiv Detail & Related papers (2026-02-03T16:36:29Z)
Refactoring with LLMs: Bridging Human Expertise and Machine Understanding [5.2993089947181735]
We draw on Martin Fowler's guidelines to design instruction strategies for 61 well-known transformation types.<n>We evaluate these strategies on benchmark examples and real-world code snippets from GitHub projects.<n>While descriptive instructions are more interpretable to humans, our results show that rule-based instructions often lead to better performance in specific scenarios.
arXiv Detail & Related papers (2025-10-04T19:40:42Z)
Refactoring $\neq$ Bug-Inducing: Improving Defect Prediction with Code Change Tactics Analysis [54.361900378970134]
Just-in-time defect prediction (JIT-DP) aims to predict the likelihood of code changes resulting in software defects at an early stage.<n>Prior research has largely ignored code during both the evaluation and methodology phases, despite its prevalence.<n>We propose Code chAnge Tactics (CAT) analysis to categorize code and its propagation, which improves labeling accuracy in the JIT-Defects4J dataset by 13.7%.
arXiv Detail & Related papers (2025-07-25T23:29:25Z)
LLM-based Multi-Agent System for Intelligent Refactoring of Haskell Code [3.8442921307218882]
We propose a large language model (LLM)-based multi-agent system to automate the process on Haskell code.<n>Results show that the proposed multi-agent system could average 11.03% decreased complexity in code, an improvement of 22.46% in overall code quality, and increase performance efficiency by an average of 13.27%.
arXiv Detail & Related papers (2025-06-24T10:17:34Z)
Training Language Models to Generate Quality Code with Program Analysis Feedback [66.0854002147103]
Code generation with large language models (LLMs) is increasingly adopted in production but fails to ensure code quality.<n>We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code.
arXiv Detail & Related papers (2025-05-28T17:57:47Z)
MANTRA: Enhancing Automated Method-Level Refactoring with Contextual RAG and Multi-Agent LLM Collaboration [44.75848695076576]
We introduce MANTRA, a comprehensive Large Language Models agent-based framework.<n>ManTRA integrates Context-Aware Retrieval-Augmented Generation, coordinated Multi-Agent Collaboration, and Verbal Reinforcement Learning.<n> Experimental results demonstrate that MANTRA substantially surpasses a baseline LLM model.
arXiv Detail & Related papers (2025-03-18T15:16:51Z)
Automated Refactoring of Non-Idiomatic Python Code: A Differentiated Replication with LLMs [54.309127753635366]
We present the results of a replication study in which we investigate GPT-4 effectiveness in recommending and suggesting idiomatic actions.<n>Our findings underscore the potential of LLMs to achieve tasks where, in the past, implementing recommenders based on complex code analyses was required.
arXiv Detail & Related papers (2025-01-28T15:41:54Z)
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [92.62952504133926]
This study evaluated the performance of three leading closed-source LLMs and six popular open-source LLMs on three commonly used benchmarks.<n>We developed a taxonomy of bugs for incorrect codes and analyzed the root cause for common bug types.<n>We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code.
arXiv Detail & Related papers (2024-07-08T17:27:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.