Automated Extract Method Refactoring with Open-Source LLMs: A Comparative Study
- URL: http://arxiv.org/abs/2510.26480v1
- Date: Thu, 30 Oct 2025 13:34:41 GMT
- Title: Automated Extract Method Refactoring with Open-Source LLMs: A Comparative Study
- Authors: Sivajeet Chand, Melih Kilic, Roland Würsching, Sushant Kumar Pandey, Alexander Pretschner,
- Abstract summary: The Extract Method (EMR) remains challenging and largely manual despite its importance in improving code readability and maintainability.<n>Recent advances in open-source, resource-efficient Large Language Models (LLMs) offer promising new approaches for such high-level tasks.
- Score: 35.50372545468027
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automating the Extract Method refactoring (EMR) remains challenging and largely manual despite its importance in improving code readability and maintainability. Recent advances in open-source, resource-efficient Large Language Models (LLMs) offer promising new approaches for automating such high-level tasks. In this work, we critically evaluate five state-of-the-art open-source LLMs, spanning 3B to 8B parameter sizes, on the EMR task for Python code. We systematically assess functional correctness and code quality using automated metrics and investigate the impact of prompting strategies by comparing one-shot prompting to a Recursive criticism and improvement (RCI) approach. RCI-based prompting consistently outperforms one-shot prompting in test pass rates and refactoring quality. The best-performing models, Deepseek-Coder-RCI and Qwen2.5-Coder-RCI, achieve test pass percentage (TPP) scores of 0.829 and 0.808, while reducing lines of code (LOC) per method from 12.103 to 6.192 and 5.577, and cyclomatic complexity (CC) from 4.602 to 3.453 and 3.294, respectively. A developer survey on RCI-generated refactorings shows over 70% acceptance, with Qwen2.5-Coder rated highest across all evaluation criteria. In contrast, the original code scored below neutral, particularly in readability and maintainability, underscoring the benefits of automated refactoring guided by quality prompts. While traditional metrics like CC and LOC provide useful signals, they often diverge from human judgments, emphasizing the need for human-in-the-loop evaluation. Our open-source benchmark offers a foundation for future research on automated refactoring with LLMs.
Related papers
- CVeDRL: An Efficient Code Verifier via Difficulty-aware Reinforcement Learning [57.24524263804788]
Code verifiers play a critical role in post-verification for LLM-based code generation.<n>Existing supervised fine-tuning methods suffer from data scarcity, high failure rates, and poor inference efficiency.<n>We show that naive RL with only functionality rewards fails to generate effective unit tests for difficult branches and samples.
arXiv Detail & Related papers (2026-01-30T10:33:29Z) - RefAgent: A Multi-agent LLM-based Framework for Automatic Software Refactoring [8.038518812060897]
RefAgent is a multi-agent LLM-based framework for end-to-end software.<n>It consists of specialized agents responsible for planning, executing, and iteratively refining testing.<n>It achieves a median unit test pass rate of 90%, reduces code smells by a median of 52.5%, and improves key quality attributes by a median of 8.6%.
arXiv Detail & Related papers (2025-11-05T03:20:58Z) - Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains [97.5573252172065]
We train a family of Automatic Reasoning Evaluators (FARE) with a simple iterative rejection-sampling supervised finetuning approach.<n>FARE-8B challenges larger specialized RL-trained evaluators and FARE-20B sets the new standard for open-source evaluators.<n>As inference-time rerankers, FARE-20B achieves near-oracle performance on MATH.
arXiv Detail & Related papers (2025-10-20T17:52:06Z) - Benchmarking and Studying the LLM-based Code Review [34.93646390349726]
Current benchmarks frequently focus on fine-grained code units, lack complete project context, and use inadequate evaluation metrics.<n>We introduce SWRBench, a new benchmark offering PR-centric review with full project context.<n>Our contributions include the SWRBench benchmark, its objective evaluation method, a comprehensive study of current ACR capabilities, and an effective enhancement approach.
arXiv Detail & Related papers (2025-09-01T14:13:34Z) - Automated Validation of LLM-based Evaluators for Software Engineering Artifacts [0.7548538278943616]
REFINE (Ranking Evaluators for FIne grained Nuanced Evaluation) is an automated framework for benchmarking large language models (LLMs)<n> REFINE applies novel generation techniques to automatically synthesize artifacts with progressively reduced quality.<n>It quantifies each candidate evaluator configuration by measuring how closely its rankings align with expected ordering.
arXiv Detail & Related papers (2025-08-04T18:52:01Z) - Teaching LLM to Reason: Reinforcement Learning from Algorithmic Problems without Code [76.80306464249217]
We propose TeaR, which aims at teaching LLMs to reason better.<n>TeaR leverages careful data curation and reinforcement learning to guide models in discovering optimal reasoning paths through code-related tasks.<n>We conduct extensive experiments using two base models and three long-CoT distillation models, with model sizes ranging from 1.5 billion to 32 billion parameters, and across 17 benchmarks spanning Math, Knowledge, Code, and Logical Reasoning.
arXiv Detail & Related papers (2025-07-10T07:34:05Z) - QiMeng-CodeV-R1: Reasoning-Enhanced Verilog Generation [51.393569044134445]
Large language models (LLMs) trained via reinforcement learning with verifiable reward (RLVR) have achieved breakthroughs on tasks with explicit, automatable verification.<n> Extending RLVR to automatically generating hardware description languages (HDLs) like Verilog from natural-language (NL) specifications, however, poses three key challenges.<n>We introduce CodeV-R1, an RLVR framework for training Verilog generation LLMs.
arXiv Detail & Related papers (2025-05-30T03:51:06Z) - MANTRA: Enhancing Automated Method-Level Refactoring with Contextual RAG and Multi-Agent LLM Collaboration [44.75848695076576]
We introduce MANTRA, a comprehensive Large Language Models agent-based framework.<n>ManTRA integrates Context-Aware Retrieval-Augmented Generation, coordinated Multi-Agent Collaboration, and Verbal Reinforcement Learning.<n> Experimental results demonstrate that MANTRA substantially surpasses a baseline LLM model.
arXiv Detail & Related papers (2025-03-18T15:16:51Z) - Reward-Guided Speculative Decoding for Efficient LLM Reasoning [80.55186052123196]
We introduce Reward-Guided Speculative Decoding (RSD), a novel framework aimed at improving the efficiency of inference in large language models (LLMs)<n>RSD incorporates a controlled bias to prioritize high-reward outputs, in contrast to existing speculative decoding methods that enforce strict unbiasedness.<n>RSD delivers significant efficiency gains against decoding with the target model only, while achieving significant better accuracy than parallel decoding method on average.
arXiv Detail & Related papers (2025-01-31T17:19:57Z) - Generating refactored code accurately using reinforcement learning [3.179831861897336]
We propose a novel reinforcement learning-based approach for fine-tuning and aligning code language models to perform automated, intelligent extract method on Java source code.<n>Our approach fine-tunes sequence-to-sequence generative models and aligns them using the Proximal Policy Optimization (PPO) algorithm.<n>Our experiments demonstrate that our approach significantly enhances the performance of large language models in code.
arXiv Detail & Related papers (2024-12-23T23:09:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.