Related papers: Together We Go Further: LLMs and IDE Static Analysis for Extract Method Refactoring

Together We Go Further: LLMs and IDE Static Analysis for Extract Method Refactoring

URL: http://arxiv.org/abs/2401.15298v2
Date: Wed, 24 Apr 2024 19:09:52 GMT
Title: Together We Go Further: LLMs and IDE Static Analysis for Extract Method Refactoring
Authors: Dorin Pomian, Abhiram Bellur, Malinda Dilhara, Zarina Kurbatova, Egor Bogomolov, Timofey Bryksin, Danny Dig,
Abstract summary: Long methods that encapsulate multiple responsibilities within a single method are challenging to maintain. Large Language Models (LLMs) have been trained on large code corpora. LLMs are very effective for giving expert suggestions, yet they are unreliable: up to 76.3% of the suggestions are hallucinations.
Score: 9.882903340467815
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Long methods that encapsulate multiple responsibilities within a single method are challenging to maintain. Choosing which statements to extract into new methods has been the target of many research tools. Despite steady improvements, these tools often fail to generate refactorings that align with developers' preferences and acceptance criteria. Given that Large Language Models (LLMs) have been trained on large code corpora, if we harness their familiarity with the way developers form functions, we could suggest refactorings that developers are likely to accept. In this paper, we advance the science and practice of refactoring by synergistically combining the insights of LLMs with the power of IDEs to perform Extract Method (EM). Our formative study on 1752 EM scenarios revealed that LLMs are very effective for giving expert suggestions, yet they are unreliable: up to 76.3% of the suggestions are hallucinations. We designed a novel approach that removes hallucinations from the candidates suggested by LLMs, then further enhances and ranks suggestions based on static analysis techniques from program slicing, and finally leverages the IDE to execute refactorings correctly. We implemented this approach in an IntelliJ IDEA plugin called EM-Assist. We empirically evaluated EM-Assist on a diverse corpus that replicates 1752 actual refactorings from open-source projects. We found that EM-Assist outperforms previous state of the art tools: EM-Assist suggests the developerperformed refactoring in 53.4% of cases, improving over the recall rate of 39.4% for previous best-in-class tools. Furthermore, we conducted firehouse surveys with 16 industrial developers and suggested refactorings on their recent commits. 81.3% of them agreed with the recommendations provided by EM-Assist.

Related papers

Leveraging LLMs, IDEs, and Semantic Embeddings for Automated Move Method Refactoring [34.91824310303815]
Large Language Models give expert suggestions, yet they are unreliable: up to 80% of suggestions are hallucinations. We introduce the first LLM fully powered assistant for MOVEMETHOD that automates its whole end-to-end lifecycle, from recommendation to execution.
arXiv Detail & Related papers (2025-03-26T19:05:20Z)
MANTRA: Enhancing Automated Method-Level Refactoring with Contextual RAG and Multi-Agent LLM Collaboration [44.75848695076576]
We introduce MANTRA, a comprehensive Large Language Models agent-based framework. ManTRA integrates Context-Aware Retrieval-Augmented Generation, coordinated Multi-Agent Collaboration, and Verbal Reinforcement Learning. Experimental results demonstrate that MANTRA substantially surpasses a baseline LLM model.
arXiv Detail & Related papers (2025-03-18T15:16:51Z)
Optimizing Knowledge Integration in Retrieval-Augmented Generation with Self-Selection [72.92366526004464]
Retrieval-Augmented Generation (RAG) has proven effective in enabling Large Language Models (LLMs) to produce more accurate and reliable responses. We propose a novel Self-Selection RAG framework, where the LLM is made to select from pairwise responses generated with internal parametric knowledge solely.
arXiv Detail & Related papers (2025-02-10T04:29:36Z)
Automated Refactoring of Non-Idiomatic Python Code: A Differentiated Replication with LLMs [54.309127753635366]
We present the results of a replication study in which we investigate GPT-4 effectiveness in recommending and suggesting idiomatic actions. Our findings underscore the potential of LLMs to achieve tasks where, in the past, implementing recommenders based on complex code analyses was required.
arXiv Detail & Related papers (2025-01-28T15:41:54Z)
Generating refactored code accurately using reinforcement learning [3.179831861897336]
We propose a novel reinforcement learning-based approach for fine-tuning and aligning code language models to perform automated, intelligent extract method on Java source code. Our approach fine-tunes sequence-to-sequence generative models and aligns them using the Proximal Policy Optimization (PPO) algorithm. Our experiments demonstrate that our approach significantly enhances the performance of large language models in code.
arXiv Detail & Related papers (2024-12-23T23:09:48Z)
Combining Domain and Alignment Vectors to Achieve Better Knowledge-Safety Trade-offs in LLMs [64.83462841029089]
We introduce an efficient merging-based alignment method called textscMergeAlign that interpolates the domain and alignment vectors, creating safer domain-specific models. We apply textscMergeAlign on Llama3 variants that are experts in medicine and finance, obtaining substantial alignment improvements with minimal to no degradation on domain-specific benchmarks.
arXiv Detail & Related papers (2024-11-11T09:32:20Z)
An Empirical Study on the Potential of LLMs in Automated Software Refactoring [9.157968996300417]
We investigate the potential of large language models (LLMs) in automated software. We find that 13 out of the 176 solutions suggested by ChatGPT and 9 out of the 137 solutions suggested by Gemini were unsafe in that they either changed the functionality of the source code or introduced syntax errors.
arXiv Detail & Related papers (2024-11-07T05:35:55Z)
AIME: AI System Optimization via Multiple LLM Evaluators [79.03422337674664]
AIME is an evaluation protocol that utilizes multiple LLMs that each independently generate an evaluation on separate criteria and then combine them via concatenation. We show AIME outperforming baseline methods in code generation tasks, with up to $62%$ higher error detection rate and up to $16%$ higher success rate than a single LLM evaluation protocol on LeetCodeHard and HumanEval datasets.
arXiv Detail & Related papers (2024-10-04T04:03:24Z)
Context-Enhanced LLM-Based Framework for Automatic Test Refactoring [10.847400457238423]
Test smells arise from poor design practices and insufficient domain knowledge. We propose UTRefactor, a context-enhanced, LLM-based framework for automatic test in Java projects. We evaluate UTRefactor on 879 tests from six open-source Java projects, reducing the number of test smells from 2,375 to 265, achieving an 89% reduction.
arXiv Detail & Related papers (2024-09-25T08:42:29Z)
FactorLLM: Factorizing Knowledge via Mixture of Experts for Large Language Models [50.331708897857574]
We introduce FactorLLM, a novel approach that decomposes well-trained dense FFNs into sparse sub-networks without requiring any further modifications. FactorLLM achieves comparable performance to the source model securing up to 85% model performance while obtaining over a 30% increase in inference speed.
arXiv Detail & Related papers (2024-08-15T16:45:16Z)
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions. We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types. We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z)
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors [64.9938658716425]
Existing evaluations of large language models' (LLMs) ability to recognize and reject unsafe user requests face three limitations. First, existing methods often use coarse-grained of unsafe topics, and are over-representing some fine-grained topics. Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which are only implicitly considered in many evaluations. Third, existing evaluations rely on large LLMs for evaluation, which can be expensive.
arXiv Detail & Related papers (2024-06-20T17:56:07Z)
EM-Assist: Safe Automated ExtractMethod Refactoring with LLMs [9.474820853051702]
We introduce EM-Assist, an IntelliJ IDEA plugin that generates suggestions and subsequently validates, enhances, and ranks them. In our evaluation of 1,752 real-worlds that took place in open-source projects, EM-Assist's recall rate was 53.4% among its top-5 recommendations, compared to 39.4% for the previous best-in-class tool.
arXiv Detail & Related papers (2024-05-31T00:32:04Z)
Behind the Intent of Extract Method Refactoring: A Systematic Literature Review [15.194527511076725]
Code is widely recognized as an essential software engineering practice to improve the understandability and maintainability of the source code. The Extract Method is considered as "Swiss army knife" of applicabilitys, as developers often apply it to improve their code quality. In recent years, several studies attempted to recommend Extract Method, allowing the collection, analysis, and revelation of actionable data-driven insights.
arXiv Detail & Related papers (2023-12-19T21:09:54Z)
Empirical Evaluation of a Live Environment for Extract Method Refactoring [0.0]
We developed a Live Refactoring Environment that visually identifies, recommends, and applies Extract Methods. Our results were significantly different and better than the ones from the code manually without further help.
arXiv Detail & Related papers (2023-07-20T16:36:02Z)
Self-Refine: Iterative Refinement with Self-Feedback [62.78755306241981]
Self-Refine is an approach for improving initial outputs from large language models (LLMs) through iterative feedback and refinement. We evaluate Self-Refine across 7 diverse tasks, ranging from dialog response generation to mathematical reasoning, using state-of-the-art (GPT-3.5, ChatGPT, and GPT-4) LLMs. Our work demonstrates that even state-of-the-art LLMs like GPT-4 can be further improved at test time using our simple, standalone approach.
arXiv Detail & Related papers (2023-03-30T18:30:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.