Related papers: A Framework for Creating Non-Regressive Test Cases via Branch Consistency Analysis Driven by Descriptions

A Framework for Creating Non-Regressive Test Cases via Branch Consistency Analysis Driven by Descriptions

URL: http://arxiv.org/abs/2506.07486v1
Date: Mon, 09 Jun 2025 07:05:48 GMT
Title: A Framework for Creating Non-Regressive Test Cases via Branch Consistency Analysis Driven by Descriptions
Authors: Yuxiang Zhang, Pengyu Xue, Zhen Yang, Xiaoxue Ren, Xiang Li, Linhao Wu, Jiancheng Zhao, Xingda Yu,
Abstract summary: DISTINCT is a Description-guided, branch-consistency analysis framework.<n>It transforms Large Language Model (LLM)-based generators into fault-aware test generators.<n>It achieves an average improvement of 14.64% in Compilation Success Rate (CSR) and 6.66% in Passing Rate (PR)
Score: 9.141981611891715
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Automated test-generation research overwhelmingly assumes the correctness of focal methods, yet practitioners routinely face non-regression scenarios where the focal method may be defective. A baseline evaluation of EvoSuite and two leading Large Language Model (LLM)-based generators, namely ChatTester and ChatUniTest, on defective focal methods reveals that despite achieving up to 83% of branch coverage, none of the generated tests expose defects. To resolve this problem, we first construct two new benchmarks, namely Defects4J-Desc and QuixBugs-Desc, for experiments. In particular, each focal method is equipped with an extra Natural Language Description (NLD) for code functionality understanding. Subsequently, we propose DISTINCT, a Description-guided, branch-consistency analysis framework that transforms LLMs into fault-aware test generators. DISTINCT carries three iterative components: (1) a Generator that derives initial tests based on the NLDs and the focal method, (2) a Validator that iteratively fixes uncompilable tests using compiler diagnostics, and (3) an Analyzer that iteratively aligns test behavior with NLD semantics via branch-level analysis. Extensive experiments confirm the effectiveness of our approach. Compared to state-of-the-art methods, DISTINCT achieves an average improvement of 14.64% in Compilation Success Rate (CSR) and 6.66% in Passing Rate (PR) across both benchmarks. It notably enhances Defect Detection Rate (DDR) on both benchmarks, with a particularly significant gain of 149.26% observed on Defects4J-Desc. In terms of code coverage, DISTINCT improves Statement Coverage (SC) by an average of 3.77% and Branch Coverage (BC) by 5.36%. These results set a new baseline for non-regressive test generation and highlight how description-driven reasoning enables LLMs to move beyond coverage chasing toward effective defect detection.

Related papers

ReLoop: Structured Modeling and Behavioral Verification for Reliable LLM-Based Optimization [6.572539312871392]
Large language models (LLMs) can translate natural language into optimization code, but silent failures pose a critical risk.<n>We introduce ReLoop, addressing silent failures from two complementary directions.
arXiv Detail & Related papers (2026-02-17T20:20:33Z)
PRIME: A Process-Outcome Alignment Benchmark for Verifiable Reasoning in Mathematics and Engineering [71.15346406323827]
We introduce PRIME, a benchmark for evaluating verifiers on Process-Outcome Alignment verification.<n>We find that current verifiers frequently fail to detect derivation flaws.<n>We propose a process-aware RLVR training paradigm utilizing verifiers selected via PRIME.
arXiv Detail & Related papers (2026-02-12T04:45:01Z)
Consistency Meets Verification: Enhancing Test Generation Quality in Large Language Models Without Ground-Truth Solutions [1.9196411948992402]
We present ConVerTest, a novel two-stage pipeline for synthesizing reliable tests without requiring prior code implementations.<n>Experiments on BIGCODEBENCH and LESS BASIC PYTHON PROBLEMS benchmarks demonstrate that ConVerTest improves test validity, line coverage, and mutation scores by up to 39%, 28%, and 18% respectively.
arXiv Detail & Related papers (2026-02-11T04:40:38Z)
Test vs Mutant: Adversarial LLM Agents for Robust Unit Test Generation [9.439427795905637]
Large language model (LLM)-based methods generate more human-readable tests but often suffer from low coverage and compilability.<n>We propose AdverTest, a novel adversarial framework for LLM-powered test case generation.<n>We show that our approach improves fault detection rates by 8.56% over the best existing LLM-based methods and by 63.30% over EvoSuite.
arXiv Detail & Related papers (2026-02-08T22:34:30Z)
Synthesizing File-Level Data for Unit Test Generation with Chain-of-Thoughts via Self-Debugging [40.29934051200609]
We propose a novel data-distillation approach to produce high-quality UT training.<n>We apply this pipeline to a large corpus of open-source projects.<n>An empirical evaluation shows that the fine-tuned model achieves high UT generation effectiveness.
arXiv Detail & Related papers (2026-02-03T06:52:54Z)
CVeDRL: An Efficient Code Verifier via Difficulty-aware Reinforcement Learning [57.24524263804788]
Code verifiers play a critical role in post-verification for LLM-based code generation.<n>Existing supervised fine-tuning methods suffer from data scarcity, high failure rates, and poor inference efficiency.<n>We show that naive RL with only functionality rewards fails to generate effective unit tests for difficult branches and samples.
arXiv Detail & Related papers (2026-01-30T10:33:29Z)
CLUE: Non-parametric Verification from Experience via Hidden-State Clustering [64.50919789875233]
We show that correctness of a solution is encoded as a geometrically separable signature within the trajectory of hidden activations.<n>ClUE consistently outperforms LLM-as-a-judge baselines and matches or exceeds modern confidence-based methods in reranking candidates.
arXiv Detail & Related papers (2025-10-02T02:14:33Z)
Eigen-1: Adaptive Multi-Agent Refinement with Monitor-Based RAG for Scientific Reasoning [53.45095336430027]
We develop a unified framework that combines implicit retrieval and structured collaboration.<n>On Humanity's Last Exam (HLE) Bio/Chem Gold, our framework achieves 48.3% accuracy.<n>Results on SuperGPQA and TRQA confirm robustness across domains.
arXiv Detail & Related papers (2025-09-25T14:05:55Z)
ORFuzz: Fuzzing the "Other Side" of LLM Safety -- Testing Over-Refusal [27.26251627767238]
Large Language Models (LLMs) increasingly exhibit over-refusal - erroneously rejecting benign queries due to overly conservative safety measures.<n>This paper introduces the first evolutionary testing framework, ORFuzz, for the systematic detection and analysis of LLM over-refusals.
arXiv Detail & Related papers (2025-08-15T05:03:26Z)
YATE: The Role of Test Repair in LLM-Based Unit Test Generation [22.67442101368384]
We propose a technique for repairing some of these incorrect tests through a combination of rule-based static analysis and re-prompting.<n>We evaluate this simple approach, named YATE, on a set of 6 open-source projects.<n>YATE achieves 22% higher line coverage, 20% higher branch coverage and kill 20% more mutants at a comparable cost.
arXiv Detail & Related papers (2025-07-24T11:32:31Z)
Deep Learning Framework Testing via Model Mutation: How Far Are We? [30.292791319442404]
We revisit the defect detection ability of existing mutation-based testing methods.<n>We identify 39 unique defects across just 23 models, of which 31 were confirmed by developers, and eight have been fixed.
arXiv Detail & Related papers (2025-06-21T08:44:33Z)
Boosting Rust Unit Test Coverage through Hybrid Program Analysis and Large Language Models [14.536415473544146]
This paper presents PALM, an approach that leverages large language models (LLMs) to enhance the generation of high-coverage unit tests.<n> PALM performs program analysis to identify branching conditions within functions, which are then combined into path constraints.<n>We implement the approach and evaluate it in 10 open-source Rust crates.
arXiv Detail & Related papers (2025-06-10T17:21:21Z)
T2I-Eval-R1: Reinforcement Learning-Driven Reasoning for Interpretable Text-to-Image Evaluation [60.620408007636016]
We propose T2I-Eval-R1, a novel reinforcement learning framework that trains open-source MLLMs using only coarse-grained quality scores.<n>Our approach integrates Group Relative Policy Optimization into the instruction-tuning process, enabling models to generate both scalar scores and interpretable reasoning chains.
arXiv Detail & Related papers (2025-05-23T13:44:59Z)
Unearthing Gems from Stones: Policy Optimization with Negative Sample Augmentation for LLM Reasoning [48.33401015101481]
We propose Behavior Constrained Policy Gradient with Negative Sample Augmentation (BCPG-NSA)<n>BCPG-NSA is a fine-grained offline framework that encompasses three stages: 1) sample segmentation, 2) consensus-based step correctness assessment combining LLM and PRM judgers, and 3) policy optimization with NSA designed to effectively mine positive steps within negative samples.<n> Experimental results show that BCPG-NSA outperforms baselines on several challenging math/coding reasoning benchmarks using the same training dataset.
arXiv Detail & Related papers (2025-05-20T14:16:49Z)
Exploring and Lifting the Robustness of LLM-powered Automated Program Repair with Metamorphic Testing [31.327835928133535]
Large language model-powered Automated Program Repair (LAPR) techniques have achieved state-of-the-art bug-fixing performance.<n>It is crucial to conduct robustness testing on LAPR techniques before their practical deployment.<n>We propose MT-LAPR, a Metamorphic Testing framework exclusively for LAPR techniques.
arXiv Detail & Related papers (2024-10-10T01:14:58Z)
TestART: Improving LLM-based Unit Testing via Co-evolution of Automated Generation and Repair Iteration [7.509927117191286]
Large language models (LLMs) have demonstrated remarkable capabilities in generating unit test cases.<n>We propose TestART, a novel unit test generation method.<n>TestART improves LLM-based unit testing via co-evolution of automated generation and repair iteration.
arXiv Detail & Related papers (2024-08-06T10:52:41Z)
Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation [11.517293765116307]
Unit testing is essential for software reliability, yet manual test creation is time-consuming and often neglected.<n>This study presents the first large-scale empirical evaluation of LLM-generated unit tests at the class level.
arXiv Detail & Related papers (2024-06-28T20:38:41Z)
Self-supervised Feature Adaptation for 3D Industrial Anomaly Detection [59.41026558455904]
We focus on multi-modal anomaly detection. Specifically, we investigate early multi-modal approaches that attempted to utilize models pre-trained on large-scale visual datasets. We propose a Local-to-global Self-supervised Feature Adaptation (LSFA) method to finetune the adaptors and learn task-oriented representation toward anomaly detection.
arXiv Detail & Related papers (2024-01-06T07:30:41Z)
Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks. We instruct an LLM to self-evaluate its answers. We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z)
LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits. Most LLMs struggle on SummEdits, with performance close to random chance. The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z)
Self-Evaluation Guided Beam Search for Reasoning [61.523627290397556]
We introduce a stepwise self-evaluation mechanism to guide and calibrate the reasoning process of Large Language Model (LLM) We propose a decoding algorithm integrating the self-evaluation guidance via beam search. Our approach surpasses the corresponding Codex-backboned baselines in few-shot accuracy by $6.34%$, $9.56%$, and $5.46%$ on the GSM8K, AQuA, and StrategyQA.
arXiv Detail & Related papers (2023-05-01T02:37:59Z)
ERNIE-SPARSE: Learning Hierarchical Efficient Transformer Through Regularized Self-Attention [48.697458429460184]
Two factors, information bottleneck sensitivity and inconsistency between different attention topologies, could affect the performance of the Sparse Transformer. This paper proposes a well-designed model named ERNIE-Sparse. It consists of two distinctive parts: (i) Hierarchical Sparse Transformer (HST) to sequentially unify local and global information, and (ii) Self-Attention Regularization (SAR) to minimize the distance for transformers with different attention topologies.
arXiv Detail & Related papers (2022-03-23T08:47:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.