Related papers: A Differential Fuzzing-Based Evaluation of Functional Equivalence in LLM-Generated Code Refactorings

A Differential Fuzzing-Based Evaluation of Functional Equivalence in LLM-Generated Code Refactorings

URL: http://arxiv.org/abs/2602.15761v1
Date: Tue, 17 Feb 2026 17:47:13 GMT
Title: A Differential Fuzzing-Based Evaluation of Functional Equivalence in LLM-Generated Code Refactorings
Authors: Simantika Bhattacharjee Dristi, Matthew B. Dwyer,
Abstract summary: We leverage differential fuzzing to check functional equivalence in large language models (LLMs)<n>LLMs show a non-trivial tendency to alter program semantics, producing 19-35% functionally non-equivalents.<n>Our experiments further demonstrate that about 21% of these non-equivalents remain undetected by the existing test suites of the three evaluated datasets.
Score: 15.211628096103473
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With the rapid adoption of large language models (LLMs) in automated code refactoring, assessing and ensuring functional equivalence between LLM-generated refactoring and the original implementation becomes critical. While prior work typically relies on predefined test cases to evaluate correctness, in this work, we leverage differential fuzzing to check functional equivalence in LLM-generated code refactorings. Unlike test-based evaluation, a differential fuzzing-based equivalence checker needs no predefined test cases and can explore a much larger input space by executing and comparing thousands of automatically generated test inputs. In a large-scale evaluation of six LLMs (CodeLlama, Codestral, StarChat2, Qwen-2.5, Olmo-3, and GPT-4o) across three datasets and two refactoring types, we find that LLMs show a non-trivial tendency to alter program semantics, producing 19-35% functionally non-equivalent refactorings. Our experiments further demonstrate that about 21% of these non-equivalent refactorings remain undetected by the existing test suites of the three evaluated datasets. Collectively, the findings of this study imply that reliance on existing tests might overestimate functional equivalence in LLM-generated code refactorings, which remain prone to semantic divergence.

Related papers

From Restructuring to Stabilization: A Large-Scale Experiment on Iterative Code Readability Refactoring with Large Language Models [5.31828955342405]
Large language models (LLMs) are increasingly used for automated code tasks.<n>This article systematically study the capabilities of LLMs for code readability.
arXiv Detail & Related papers (2026-02-25T12:05:25Z)
SWE-Refactor: A Repository-Level Benchmark for Real-World LLM-Based Code Refactoring [20.694251041823097]
Large Language Models (LLMs) have attracted wide interest for tackling software engineering tasks.<n>Existing benchmarks commonly suffer from three shortcomings.<n>SWE-Refactor comprises 1,099 developer-written, behavior-preserving LLMs mined from 18 Java projects.
arXiv Detail & Related papers (2026-02-03T16:36:29Z)
From Human to Machine Refactoring: Assessing GPT-4's Impact on Python Class Quality and Readability [46.83143241367452]
Refactoring aims to improve code quality without altering program behavior.<n>Recent advances in Large Language Models (LLMs) have introduced new opportunities for automated code preservation.<n>We present an empirical study on LLM-driven classes using GPT-4o, applied to 100 Python classes from the ClassEval benchmark.<n>Our findings show that GPT-4o generally produces behavior-preservings that reduce code smells and improve quality metrics, albeit at the cost of decreased readability.
arXiv Detail & Related papers (2026-01-19T15:22:37Z)
CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z)
Refactoring $\neq$ Bug-Inducing: Improving Defect Prediction with Code Change Tactics Analysis [54.361900378970134]
Just-in-time defect prediction (JIT-DP) aims to predict the likelihood of code changes resulting in software defects at an early stage.<n>Prior research has largely ignored code during both the evaluation and methodology phases, despite its prevalence.<n>We propose Code chAnge Tactics (CAT) analysis to categorize code and its propagation, which improves labeling accuracy in the JIT-Defects4J dataset by 13.7%.
arXiv Detail & Related papers (2025-07-25T23:29:25Z)
Use Property-Based Testing to Bridge LLM Code Generation and Validation [38.25155484701058]
Large Language Models (LLMs) excel at code generation, but ensuring their outputs to be functionally correct is a persistent challenge.<n>This paper introduces Property-Generated Solver, a novel framework that leverages Property-Based Testing (PBT) to validate high-level program properties.<n>Property-Generated Solver employs two collaborative LLM-based agents: a Generator dedicated to code generation and iterative refinement, and a Tester that manages the PBT life-cycle.
arXiv Detail & Related papers (2025-06-23T06:01:12Z)
Automated Refactoring of Non-Idiomatic Python Code: A Differentiated Replication with LLMs [54.309127753635366]
We present the results of a replication study in which we investigate GPT-4 effectiveness in recommending and suggesting idiomatic actions.<n>Our findings underscore the potential of LLMs to achieve tasks where, in the past, implementing recommenders based on complex code analyses was required.
arXiv Detail & Related papers (2025-01-28T15:41:54Z)
Improving the Readability of Automatically Generated Tests using Large Language Models [7.7149881834358345]
We propose to combine the effectiveness of search-based generators with the readability of LLM generated tests.<n>Our approach focuses on improving test and variable names produced by search-based tools, while keeping their semantics unchanged.
arXiv Detail & Related papers (2024-12-25T09:08:53Z)
Toward Automated Validation of Language Model Synthesized Test Cases using Semantic Entropy [0.5057850174013127]
Modern Large Language Model (LLM)-based programming agents often rely on test execution feedback to refine their generated code.<n>This paper introduces VALTEST, a novel framework that leverages semantic entropy to automatically validate test cases generated by LLMs.<n>Experiments show that VALTEST boosts test validity by up to 29% and improves code generation performance, as evidenced by significant increases in pass@1 scores.
arXiv Detail & Related papers (2024-11-13T00:07:32Z)
What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [92.62952504133926]
This study evaluated the performance of three leading closed-source LLMs and six popular open-source LLMs on three commonly used benchmarks.<n>We developed a taxonomy of bugs for incorrect codes and analyzed the root cause for common bug types.<n>We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code.
arXiv Detail & Related papers (2024-07-08T17:27:17Z)
Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score. Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score. Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z)
Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks. We instruct an LLM to self-evaluate its answers. We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.