Related papers: CodeTaste: Can LLMs Generate Human-Level Code Refactorings?

CodeTaste: Can LLMs Generate Human-Level Code Refactorings?

URL: http://arxiv.org/abs/2603.04177v1
Date: Wed, 04 Mar 2026 15:34:18 GMT
Title: CodeTaste: Can LLMs Generate Human-Level Code Refactorings?
Authors: Alex Thillen, Niels Mündler, Veselin Raychev, Martin Vechev,
Abstract summary: Large language model (LLM) coding agents can generate working code, but their solutions often accumulate complexity, duplication, and architectural debt.<n>Human developers address such issues through: behavior-preserving program that improve structure and maintainability.<n>We present CodeTaste, a benchmark of tasks mined from large-scale multi-file changes in open-source repositories.
Score: 2.447746234944228
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language model (LLM) coding agents can generate working code, but their solutions often accumulate complexity, duplication, and architectural debt. Human developers address such issues through refactoring: behavior-preserving program transformations that improve structure and maintainability. In this paper, we investigate if LLM agents (i) can execute refactorings reliably and (ii) identify the refactorings that human developers actually chose in real codebases. We present CodeTaste, a benchmark of refactoring tasks mined from large-scale multi-file changes in open-source repositories. To score solutions, we combine repository test suites with custom static checks that verify removal of undesired patterns and introduction of desired patterns using dataflow reasoning. Our experimental results indicate a clear gap across frontier models: agents perform well when refactorings are specified in detail, but often fail to discover the human refactoring choices when only presented with a focus area for improvement. A propose-then-implement decomposition improves alignment, and selecting the best-aligned proposal before implementation can yield further gains. CodeTaste provides an evaluation target and a potential preference signal for aligning coding agents with human refactoring decisions in realistic codebases.

Related papers

SWE-Refactor: A Repository-Level Benchmark for Real-World LLM-Based Code Refactoring [20.694251041823097]
Large Language Models (LLMs) have attracted wide interest for tackling software engineering tasks.<n>Existing benchmarks commonly suffer from three shortcomings.<n>SWE-Refactor comprises 1,099 developer-written, behavior-preserving LLMs mined from 18 Java projects.
arXiv Detail & Related papers (2026-02-03T16:36:29Z)
How do Agents Refactor: An Empirical Study [2.7711196026307476]
We present the first analysis of agentic pull requests in Java.<n>We identify types and detect code smells before and after commits.<n>We find Cursor to be the only model to show a statistically significant increase in smells.
arXiv Detail & Related papers (2026-01-28T01:34:15Z)
AlignCoder: Aligning Retrieval with Target Intent for Repository-Level Code Completion [55.21541958868449]
We propose AlignCoder, a repository-level code completion framework.<n>Our framework generates an enhanced query that bridges the semantic gap between the initial query and the target code.<n>We employ reinforcement learning to train an AlignRetriever that learns to leverage inference information in the enhanced query for more accurate retrieval.
arXiv Detail & Related papers (2026-01-27T15:23:14Z)
Refactoring with LLMs: Bridging Human Expertise and Machine Understanding [5.2993089947181735]
We draw on Martin Fowler's guidelines to design instruction strategies for 61 well-known transformation types.<n>We evaluate these strategies on benchmark examples and real-world code snippets from GitHub projects.<n>While descriptive instructions are more interpretable to humans, our results show that rule-based instructions often lead to better performance in specific scenarios.
arXiv Detail & Related papers (2025-10-04T19:40:42Z)
Turning the Tide: Repository-based Code Reflection [52.13709676656648]
We introduce LiveRepoReflection, a benchmark for evaluating code understanding and generation in multi-file repository contexts.<n>1,888 rigorously filtered test cases across $6$ programming languages to ensure diversity, correctness, and high difficulty.<n>We also create RepoReflection-Instruct, a large-scale, quality-filtered instruction-tuning dataset derived from diverse sources.
arXiv Detail & Related papers (2025-07-14T02:36:27Z)
Refactoring Codebases through Library Design [21.039476331720312]
We investigate code agents' capacity to code in ways that support growth and reusability.<n>We present both a benchmark and a method for generating reusable libraries.<n>We compare Librarian to state-of-the-art library generation methods, and study it on real-world code bases.
arXiv Detail & Related papers (2025-05-26T07:26:33Z)
Automated Refactoring of Non-Idiomatic Python Code: A Differentiated Replication with LLMs [54.309127753635366]
We present the results of a replication study in which we investigate GPT-4 effectiveness in recommending and suggesting idiomatic actions.<n>Our findings underscore the potential of LLMs to achieve tasks where, in the past, implementing recommenders based on complex code analyses was required.
arXiv Detail & Related papers (2025-01-28T15:41:54Z)
StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback [58.20547418182074]
We introduce StepCoder, a novel framework for code generation, consisting of two main components. CCCS addresses the exploration challenge by breaking the long sequences code generation task into a Curriculum of Code Completion Subtasks. FGO only optimize the model by masking the unexecuted code segments to provide Fine-Grained Optimization. Our method improves the ability to explore the output space and outperforms state-of-the-art approaches in corresponding benchmarks.
arXiv Detail & Related papers (2024-02-02T13:14:31Z)
Do code refactorings influence the merge effort? [80.1936417993664]
Multiple contributors frequently change the source code in parallel to implement new features, fix bugs, existing code, and make other changes. These simultaneous changes need to be merged into the same version of the source code. Studies show that 10 to 20 percent of all merge attempts result in conflicts, which require the manual developer's intervention to complete the process.
arXiv Detail & Related papers (2023-05-10T13:24:59Z)
RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation [96.75695811963242]
RepoCoder is a framework to streamline the repository-level code completion process. It incorporates a similarity-based retriever and a pre-trained code language model. It consistently outperforms the vanilla retrieval-augmented code completion approach.
arXiv Detail & Related papers (2023-03-22T13:54:46Z)
How We Refactor and How We Document it? On the Use of Supervised Machine Learning Algorithms to Classify Refactoring Documentation [25.626914797750487]
Refactoring is the art of improving the design of a system without altering its external behavior. This study categorizes commits into 3 categories, namely, Internal QA, External QA, and Code Smell Resolution, along with the traditional BugFix and Functional categories. To better understand our classification results, we analyzed commit messages to extract patterns that developers regularly use to describe their smells.
arXiv Detail & Related papers (2020-10-26T20:33:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.