A Framework for Creating Non-Regressive Test Cases via Branch Consistency Analysis Driven by Descriptions
- URL: http://arxiv.org/abs/2506.07486v1
- Date: Mon, 09 Jun 2025 07:05:48 GMT
- Title: A Framework for Creating Non-Regressive Test Cases via Branch Consistency Analysis Driven by Descriptions
- Authors: Yuxiang Zhang, Pengyu Xue, Zhen Yang, Xiaoxue Ren, Xiang Li, Linhao Wu, Jiancheng Zhao, Xingda Yu,
- Abstract summary: DISTINCT is a Description-guided, branch-consistency analysis framework.<n>It transforms Large Language Model (LLM)-based generators into fault-aware test generators.<n>It achieves an average improvement of 14.64% in Compilation Success Rate (CSR) and 6.66% in Passing Rate (PR)
- Score: 9.141981611891715
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automated test-generation research overwhelmingly assumes the correctness of focal methods, yet practitioners routinely face non-regression scenarios where the focal method may be defective. A baseline evaluation of EvoSuite and two leading Large Language Model (LLM)-based generators, namely ChatTester and ChatUniTest, on defective focal methods reveals that despite achieving up to 83% of branch coverage, none of the generated tests expose defects. To resolve this problem, we first construct two new benchmarks, namely Defects4J-Desc and QuixBugs-Desc, for experiments. In particular, each focal method is equipped with an extra Natural Language Description (NLD) for code functionality understanding. Subsequently, we propose DISTINCT, a Description-guided, branch-consistency analysis framework that transforms LLMs into fault-aware test generators. DISTINCT carries three iterative components: (1) a Generator that derives initial tests based on the NLDs and the focal method, (2) a Validator that iteratively fixes uncompilable tests using compiler diagnostics, and (3) an Analyzer that iteratively aligns test behavior with NLD semantics via branch-level analysis. Extensive experiments confirm the effectiveness of our approach. Compared to state-of-the-art methods, DISTINCT achieves an average improvement of 14.64% in Compilation Success Rate (CSR) and 6.66% in Passing Rate (PR) across both benchmarks. It notably enhances Defect Detection Rate (DDR) on both benchmarks, with a particularly significant gain of 149.26% observed on Defects4J-Desc. In terms of code coverage, DISTINCT improves Statement Coverage (SC) by an average of 3.77% and Branch Coverage (BC) by 5.36%. These results set a new baseline for non-regressive test generation and highlight how description-driven reasoning enables LLMs to move beyond coverage chasing toward effective defect detection.
Related papers
- YATE: The Role of Test Repair in LLM-Based Unit Test Generation [22.67442101368384]
We propose a technique for repairing some of these incorrect tests through a combination of rule-based static analysis and re-prompting.<n>We evaluate this simple approach, named YATE, on a set of 6 open-source projects.<n>YATE achieves 22% higher line coverage, 20% higher branch coverage and kill 20% more mutants at a comparable cost.
arXiv Detail & Related papers (2025-07-24T11:32:31Z) - Deep Learning Framework Testing via Model Mutation: How Far Are We? [30.292791319442404]
We revisit the defect detection ability of existing mutation-based testing methods.<n>We identify 39 unique defects across just 23 models, of which 31 were confirmed by developers, and eight have been fixed.
arXiv Detail & Related papers (2025-06-21T08:44:33Z) - Boosting Rust Unit Test Coverage through Hybrid Program Analysis and Large Language Models [14.536415473544146]
This paper presents PALM, an approach that leverages large language models (LLMs) to enhance the generation of high-coverage unit tests.<n> PALM performs program analysis to identify branching conditions within functions, which are then combined into path constraints.<n>We implement the approach and evaluate it in 10 open-source Rust crates.
arXiv Detail & Related papers (2025-06-10T17:21:21Z) - T2I-Eval-R1: Reinforcement Learning-Driven Reasoning for Interpretable Text-to-Image Evaluation [60.620408007636016]
We propose T2I-Eval-R1, a novel reinforcement learning framework that trains open-source MLLMs using only coarse-grained quality scores.<n>Our approach integrates Group Relative Policy Optimization into the instruction-tuning process, enabling models to generate both scalar scores and interpretable reasoning chains.
arXiv Detail & Related papers (2025-05-23T13:44:59Z) - Unearthing Gems from Stones: Policy Optimization with Negative Sample Augmentation for LLM Reasoning [48.33401015101481]
We propose Behavior Constrained Policy Gradient with Negative Sample Augmentation (BCPG-NSA)<n>BCPG-NSA is a fine-grained offline framework that encompasses three stages: 1) sample segmentation, 2) consensus-based step correctness assessment combining LLM and PRM judgers, and 3) policy optimization with NSA designed to effectively mine positive steps within negative samples.<n> Experimental results show that BCPG-NSA outperforms baselines on several challenging math/coding reasoning benchmarks using the same training dataset.
arXiv Detail & Related papers (2025-05-20T14:16:49Z) - Exploring and Lifting the Robustness of LLM-powered Automated Program Repair with Metamorphic Testing [31.327835928133535]
Large language model-powered Automated Program Repair (LAPR) techniques have achieved state-of-the-art bug-fixing performance.<n>It is crucial to conduct robustness testing on LAPR techniques before their practical deployment.<n>We propose MT-LAPR, a Metamorphic Testing framework exclusively for LAPR techniques.
arXiv Detail & Related papers (2024-10-10T01:14:58Z) - TestART: Improving LLM-based Unit Testing via Co-evolution of Automated Generation and Repair Iteration [7.509927117191286]
Large language models (LLMs) have demonstrated remarkable capabilities in generating unit test cases.<n>We propose TestART, a novel unit test generation method.<n>TestART improves LLM-based unit testing via co-evolution of automated generation and repair iteration.
arXiv Detail & Related papers (2024-08-06T10:52:41Z) - Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation [11.517293765116307]
Unit testing is essential for software reliability, yet manual test creation is time-consuming and often neglected.<n>This study presents the first large-scale empirical evaluation of LLM-generated unit tests at the class level.
arXiv Detail & Related papers (2024-06-28T20:38:41Z) - Self-supervised Feature Adaptation for 3D Industrial Anomaly Detection [59.41026558455904]
We focus on multi-modal anomaly detection. Specifically, we investigate early multi-modal approaches that attempted to utilize models pre-trained on large-scale visual datasets.
We propose a Local-to-global Self-supervised Feature Adaptation (LSFA) method to finetune the adaptors and learn task-oriented representation toward anomaly detection.
arXiv Detail & Related papers (2024-01-06T07:30:41Z) - Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks.
We instruct an LLM to self-evaluate its answers.
We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z) - LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits.
Most LLMs struggle on SummEdits, with performance close to random chance.
The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z) - Self-Evaluation Guided Beam Search for Reasoning [61.523627290397556]
We introduce a stepwise self-evaluation mechanism to guide and calibrate the reasoning process of Large Language Model (LLM)
We propose a decoding algorithm integrating the self-evaluation guidance via beam search.
Our approach surpasses the corresponding Codex-backboned baselines in few-shot accuracy by $6.34%$, $9.56%$, and $5.46%$ on the GSM8K, AQuA, and StrategyQA.
arXiv Detail & Related papers (2023-05-01T02:37:59Z) - ERNIE-SPARSE: Learning Hierarchical Efficient Transformer Through
Regularized Self-Attention [48.697458429460184]
Two factors, information bottleneck sensitivity and inconsistency between different attention topologies, could affect the performance of the Sparse Transformer.
This paper proposes a well-designed model named ERNIE-Sparse.
It consists of two distinctive parts: (i) Hierarchical Sparse Transformer (HST) to sequentially unify local and global information, and (ii) Self-Attention Regularization (SAR) to minimize the distance for transformers with different attention topologies.
arXiv Detail & Related papers (2022-03-23T08:47:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.