Related papers: Beyond Surface Similarity: Evaluating LLM-Based Test Refactorings with Structural and Semantic Awareness

Beyond Surface Similarity: Evaluating LLM-Based Test Refactorings with Structural and Semantic Awareness

URL: http://arxiv.org/abs/2506.06767v2
Date: Sat, 18 Oct 2025 17:14:18 GMT
Title: Beyond Surface Similarity: Evaluating LLM-Based Test Refactorings with Structural and Semantic Awareness
Authors: Wendkûuni C. Ouédraogo, Yinghua Li, Xueqi Dang, Xin Zhou, Anil Koyuncu, Jacques Klein, David Lo, Tegawendé F. Bissyandé,
Abstract summary: Large Language Models (LLMs) are increasingly used to improve readability and structure while preserving behavior.<n>We propose CTSES, a first step toward human-aligned evaluation of LLMs.<n>CTSES combines CodeBLEU, METEOR, and ROUGE-L into a composite score that balances semantics, lexical clarity, and structural alignment.
Score: 15.677544288705883
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large Language Models (LLMs) are increasingly used to refactor unit tests, improving readability and structure while preserving behavior. Evaluating such refactorings, however, remains difficult: metrics like CodeBLEU penalize beneficial renamings and edits, while semantic similarities overlook readability and modularity. We propose CTSES, a first step toward human-aligned evaluation of refactored tests. CTSES combines CodeBLEU, METEOR, and ROUGE-L into a composite score that balances semantics, lexical clarity, and structural alignment. Evaluated on 5,000+ refactorings from Defects4J and SF110 (GPT-4o and Mistral-Large), CTSES reduces false negatives and provides more interpretable signals than individual metrics. Our emerging results illustrate that CTSES offers a proof-of-concept for composite approaches, showing their promise in bridging automated metrics and developer judgments.

Related papers

From Restructuring to Stabilization: A Large-Scale Experiment on Iterative Code Readability Refactoring with Large Language Models [5.31828955342405]
Large language models (LLMs) are increasingly used for automated code tasks.<n>This article systematically study the capabilities of LLMs for code readability.
arXiv Detail & Related papers (2026-02-25T12:05:25Z)
SWE-Refactor: A Repository-Level Benchmark for Real-World LLM-Based Code Refactoring [20.694251041823097]
Large Language Models (LLMs) have attracted wide interest for tackling software engineering tasks.<n>Existing benchmarks commonly suffer from three shortcomings.<n>SWE-Refactor comprises 1,099 developer-written, behavior-preserving LLMs mined from 18 Java projects.
arXiv Detail & Related papers (2026-02-03T16:36:29Z)
From Human to Machine Refactoring: Assessing GPT-4's Impact on Python Class Quality and Readability [46.83143241367452]
Refactoring aims to improve code quality without altering program behavior.<n>Recent advances in Large Language Models (LLMs) have introduced new opportunities for automated code preservation.<n>We present an empirical study on LLM-driven classes using GPT-4o, applied to 100 Python classes from the ClassEval benchmark.<n>Our findings show that GPT-4o generally produces behavior-preservings that reduce code smells and improve quality metrics, albeit at the cost of decreased readability.
arXiv Detail & Related papers (2026-01-19T15:22:37Z)
Test Case Generation from Bug Reports via Large Language Models: A Cognitive Layered Evaluation Framework [10.919459368597295]
We present a systematic evaluation of Large Language Models (LLMs) reasoning in test case generation.<n>We evaluate StarCoder and GPT-4o on Defects4J, GHRB, and mutated variants that introduce linguistic and semantic challenges.
arXiv Detail & Related papers (2025-10-06T20:47:12Z)
CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z)
Measuring how changes in code readability attributes affect code quality evaluation by Large Language Models [2.3204178451683264]
Code readability is one of the main aspects of code quality, influenced by various properties like identifier names, comments, code structure, and adherence to standards.<n>This paper explores using Large Language Models (LLMs) to evaluate code quality attributes related to its readability in a standardized, reproducible, and consistent manner.
arXiv Detail & Related papers (2025-07-05T11:08:03Z)
ConsistencyChecker: Tree-based Evaluation of LLM Generalization Capabilities [14.13459302125202]
evaluating consistency in large language models (LLMs) is crucial for ensuring reliability.<n>Traditional self-consistency methods often miss subtle semantic changes in natural language and functional shifts in code or equations.<n>We propose ConsistencyChecker, a tree-based evaluation framework designed to measure consistency through sequences of reversible transformations.
arXiv Detail & Related papers (2025-06-14T07:18:33Z)
Mind the Gap: A Readability-Aware Metric for Test Code Complexity [13.258954013620885]
We introduce CCTR, a Test-Aware Cognitive Complexity metric tailored for unit tests.<n>We evaluate 15,750 test suites generated by EvoSuite, GPT-4o, and Mistral Large-1024 across 350 classes from Defects4J and SF110.
arXiv Detail & Related papers (2025-06-07T11:16:13Z)
Learning to Align Multi-Faceted Evaluation: A Unified and Robust Framework [61.38174427966444]
Large Language Models (LLMs) are being used more and more extensively for automated evaluation in various scenarios.<n>Previous studies have attempted to fine-tune open-source LLMs to replicate the evaluation explanations and judgments of powerful proprietary models.<n>We propose a novel evaluation framework, ARJudge, that adaptively formulates evaluation criteria and synthesizes both text-based and code-driven analyses.
arXiv Detail & Related papers (2025-02-26T06:31:45Z)
Scoring Verifiers: Evaluating Synthetic Verification for Code and Reasoning [59.25951947621526]
We propose an approach which can transform existing coding benchmarks into scoring and ranking datasets to evaluate the effectiveness of synthetic verifiers.<n>We release four new benchmarks (HE-R, HE-R+, MBPP-R, and MBPP-R+), and analyzed synthetic verification methods with standard, reasoning-based, and reward-based LLMs.<n>Our experiments show that reasoning can significantly improve test case generation and that scaling the number of test cases enhances the verification accuracy.
arXiv Detail & Related papers (2025-02-19T15:32:11Z)
EquiBench: Benchmarking Large Language Models' Understanding of Program Semantics via Equivalence Checking [55.81461218284736]
EquiBench is a new benchmark for evaluating large language models (LLMs)<n>It determines whether two programs produce identical outputs for all possible inputs.<n>We evaluate 19 state-of-the-art LLMs and find that the best accuracies are 63.8% and 76.2%, only modestly above the 50% random baseline.
arXiv Detail & Related papers (2025-02-18T02:54:25Z)
Automated Refactoring of Non-Idiomatic Python Code: A Differentiated Replication with LLMs [54.309127753635366]
We present the results of a replication study in which we investigate GPT-4 effectiveness in recommending and suggesting idiomatic actions.<n>Our findings underscore the potential of LLMs to achieve tasks where, in the past, implementing recommenders based on complex code analyses was required.
arXiv Detail & Related papers (2025-01-28T15:41:54Z)
StructTest: Benchmarking LLMs' Reasoning through Compositional Structured Outputs [78.84060166851805]
StructTest is a novel benchmark that evaluates large language models (LLMs) on their ability to follow compositional instructions and generate structured outputs.<n> Assessments are conducted deterministically using a rule-based evaluator, which can be easily extended to new tasks and datasets.<n>We demonstrate that StructTest remains challenging even for top-performing models like Deepseek-V3/R1 and GPT-4o.
arXiv Detail & Related papers (2024-12-23T22:08:40Z)
Semantic Consistency for Assuring Reliability of Large Language Models [9.040736633675136]
Large Language Models (LLMs) exhibit remarkable fluency and competence across various natural language tasks.<n>We introduce a general measure of semantic consistency, and formulate multiple versions of this metric to evaluate the performance of various LLMs.<n>We propose a novel prompting strategy, called Ask-to-Choose (A2C), to enhance semantic consistency.
arXiv Detail & Related papers (2023-08-17T18:11:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.