Related papers: Evaluating the Effectiveness of Small Language Models in Detecting Refactoring Bugs

Evaluating the Effectiveness of Small Language Models in Detecting Refactoring Bugs

URL: http://arxiv.org/abs/2502.18454v2
Date: Fri, 28 Mar 2025 17:43:30 GMT
Title: Evaluating the Effectiveness of Small Language Models in Detecting Refactoring Bugs
Authors: Rohit Gheyi, Marcio Ribeiro, Jonhnanthan Oliveira,
Abstract summary: This study evaluates the effectiveness of Small Language Models (SLMs) in detecting two types of bugs in Java and Python.<n>The study covers 16 types and employs zero-shot prompting on consumer-grade hardware to evaluate the models' ability to reason about correctness without explicit prior training.<n>The proprietary o3-mini-high model achieved the highest detection rate, identifying 84.3% of Type I bugs.
Score: 0.6133301815445301
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Popular IDEs frequently contain bugs in their refactoring implementations. Ensuring that a transformation preserves a program's behavior is a complex task. Traditional detection methods rely on predefined preconditions for each refactoring type, limiting their scalability and adaptability to new transformations. These methods often require extensive static and dynamic analyses, which are computationally expensive, time-consuming, and may still fail to detect certain refactoring bugs. This study evaluates the effectiveness of Small Language Models (SLMs) in detecting two types of refactoring bugs in Java and Python: (i) transformations that introduce errors or behavioral changes (Type I) and (ii) transformations unnecessarily blocked by IDEs despite being valid (Type II). We assess whether Llama 3.2 3B, Mistral 7B, Gemma 2 9B, Gemma 3 12B, DeepSeek-R1 14B, Phi-4 14B, o1-mini, and o3-mini-high can accurately detect 100 refactoring bugs reported in widely used Java and Python IDEs, such as Eclipse and NetBeans. The study covers 16 refactoring types and employs zero-shot prompting on consumer-grade hardware to evaluate the models' ability to reason about refactoring correctness without explicit prior training. The proprietary o3-mini-high model achieved the highest detection rate, identifying 84.3% of Type I bugs. The open-source Phi-4 14B performed comparably well, demonstrating strong effectiveness across both bug types. However, o3-mini-high struggled with Type II bugs, correctly identifying and applying valid but blocked transformations in only 40% of cases. The findings highlight the potential of SLMs for efficiently detecting refactoring bugs, particularly in verifying behavioral changes. Additionally, SLMs offer a more adaptable solution capable of generalizing across different refactoring types and programming languages, addressing key limitations of traditional approaches.

Related papers

Semantic-Preserving Transformations as Mutation Operators: A Study on Their Effectiveness in Defect Detection [3.3590922002216197]
We collect existing publications which implemented semantic-preserving transformations and share their implementation. We empirically study the effectiveness of three different ensemble strategies for enhancing defect detection tools. Our results show that reusing shared semantic-preserving transformation is difficult, sometimes even causing wrongful changes to the semantics.
arXiv Detail & Related papers (2025-03-30T14:00:22Z)
Learning to Solve and Verify: A Self-Play Framework for Code and Test Generation [69.62857948698436]
Recent advances in large language models (LLMs) have improved their performance on coding benchmarks. However, improvement is plateauing due to the exhaustion of readily available high-quality data. We propose Sol-Ver, a self-play solver-verifier framework that jointly improves a single model's code and test generation capacity.
arXiv Detail & Related papers (2025-02-20T18:32:19Z)
Comprehensive Multi-Modal Prototypes are Simple and Effective Classifiers for Vast-Vocabulary Object Detection [68.26282316080558]
Current open-world detectors can recognize a broader range of vocabularies, despite being trained on limited categories. We introduce Prova, a prototype classifier for vast-vocabulary object detection.
arXiv Detail & Related papers (2024-12-23T18:57:43Z)
An Empirical Study of Refactoring Engine Bugs [7.412890903261693]
We present the first systematic study of engine bugs by analyzing bugs in Eclipse, IntelliJ IDEA, and Netbeans. We analyzed these bugs according to their types, symptoms, root causes, and triggering conditions. Our transferability study revealed 130 new bugs in the latest version of those engines.
arXiv Detail & Related papers (2024-09-22T22:09:39Z)
Detecting Refactoring Commits in Machine Learning Python Projects: A Machine Learning-Based Approach [3.000496428347787]
MLRefScanner identifies commits with both ML-specific and general operations. Our study highlights the potential of ML-driven approaches in detecting programming across diverse languages and technical domains.
arXiv Detail & Related papers (2024-04-09T18:46:56Z)
ReGAL: Refactoring Programs to Discover Generalizable Abstractions [59.05769810380928]
Generalizable Abstraction Learning (ReGAL) is a method for learning a library of reusable functions via codeization. We find that the shared function libraries discovered by ReGAL make programs easier to predict across diverse domains. For CodeLlama-13B, ReGAL results in absolute accuracy increases of 11.5% on LOGO, 26.1% on date understanding, and 8.1% on TextCraft, outperforming GPT-3.5 in two of three domains.
arXiv Detail & Related papers (2024-01-29T18:45:30Z)
DebugBench: Evaluating Debugging Capability of Large Language Models [80.73121177868357]
DebugBench is a benchmark for Large Language Models (LLMs) It covers four major bug categories and 18 minor types in C++, Java, and Python. We evaluate two commercial and four open-source models in a zero-shot scenario.
arXiv Detail & Related papers (2024-01-09T15:46:38Z)
Automated Bug Generation in the era of Large Language Models [6.0770779409377775]
BugFarm transforms arbitrary code into multiple complex bugs. A comprehensive evaluation of 435k+ bugs from over 1.9M mutants generated by BUGFARM.
arXiv Detail & Related papers (2023-10-03T20:01:51Z)
BigIssue: A Realistic Bug Localization Benchmark [89.8240118116093]
BigIssue is a benchmark for realistic bug localization. We provide a general benchmark with a diversity of real and synthetic Java bugs. We hope to advance the state of the art in bug localization, in turn improving APR performance and increasing its applicability to the modern development cycle.
arXiv Detail & Related papers (2022-07-21T20:17:53Z)
Generating Bug-Fixes Using Pretrained Transformers [11.012132897417592]
We introduce a data-driven program repair approach which learns to detect and fix bugs in Java methods mined from real-world GitHub. We show that pretraining on source code programs improves the number of patches found by 33% as compared to supervised training from scratch. We refine the standard accuracy evaluation metric into non-deletion and deletion-only fixes, and show that our best model generates 75% more non-deletion fixes than the previous state of the art.
arXiv Detail & Related papers (2021-04-16T05:27:04Z)
Prior Guided Feature Enrichment Network for Few-Shot Segmentation [64.91560451900125]
State-of-the-art semantic segmentation methods require sufficient labeled data to achieve good results. Few-shot segmentation is proposed to tackle this problem by learning a model that quickly adapts to new classes with a few labeled support samples. Theses frameworks still face the challenge of generalization ability reduction on unseen classes due to inappropriate use of high-level semantic information.
arXiv Detail & Related papers (2020-08-04T10:41:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.