Related papers: LTM: Scalable and Black-box Similarity-based Test Suite Minimization based on Language Models

LTM: Scalable and Black-box Similarity-based Test Suite Minimization based on Language Models

URL: http://arxiv.org/abs/2304.01397v5
Date: Mon, 30 Sep 2024 20:29:59 GMT
Title: LTM: Scalable and Black-box Similarity-based Test Suite Minimization based on Language Models
Authors: Rongqi Pan, Taher A. Ghaleb, Lionel Briand,
Abstract summary: Test suites tend to grow when software evolves, making it often infeasible to execute all test cases with the allocated testing budgets. Test suite minimization (TSM) is employed to improve the efficiency of software testing by removing redundant test cases. We propose LTM (Language model-based Test suite Minimization), a novel, scalable, and black-box similarity-based TSM approach.
Score: 0.6562256987706128
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Test suites tend to grow when software evolves, making it often infeasible to execute all test cases with the allocated testing budgets, especially for large software systems. Test suite minimization (TSM) is employed to improve the efficiency of software testing by removing redundant test cases, thus reducing testing time and resources, while maintaining the fault detection capability of the test suite. Most existing TSM approaches rely on code coverage (white-box) or model-based features, which are not always available to test engineers. Recent TSM approaches that rely only on test code (black-box) have been proposed, such as ATM and FAST-R. To address the scalability, we propose LTM (Language model-based Test suite Minimization), a novel, scalable, and black-box similarity-based TSM approach based on large language models (LLMs), which is the first application of LLMs in the context of TSM. To support similarity measurement for test code embeddings, we investigate five pre-trained language models: CodeBERT, GraphCodeBERT, UniXcoder, StarEncoder, and CodeLlama, on which we compute two similarity measures: Cosine Similarity and Euclidean Distance. Our goal is to find similarity measures that are not only computationally more efficient but can also better guide a Genetic Algorithm (GA) to search for optimal minimized test suites, thus reducing the overall search time. Experimental results show that the best configuration of LTM (UniXcoder/Cosine) outperforms ATM in three aspects: (a) achieving a slightly greater saving rate of testing time (41.72% versus 41.02%, on average); (b) attaining a significantly higher fault detection rate (0.84 versus 0.81, on average); and, most importantly, (c) minimizing test suites nearly five times faster on average, with higher gains for larger test suites and systems, thus achieving much higher scalability.

Related papers

$\ exttt{SPECS}$: Faster Test-Time Scaling through Speculative Drafts [55.231201692232894]
$textttSPECS$ is a latency-aware test-time scaling method inspired by speculative decoding.<n>Our results show that $textttSPECS$matches or surpasses beam search accuracy while reducing latency by up to $sim$19.1%.
arXiv Detail & Related papers (2025-06-15T05:50:05Z)
Requirements Coverage-Guided Minimization for Natural Language Test Cases [7.947774587906927]
Test suites tend to grow in size and often contain redundant test cases.<n>Test suite minimization aims to eliminate such redundancy while preserving key properties such as requirement coverage and fault detection capability.<n>We propose RTM (Requirement coverage-guided Test suite Minimization), a novel TSM approach designed for requirement-based testing.
arXiv Detail & Related papers (2025-05-26T13:55:33Z)
Sample, Don't Search: Rethinking Test-Time Alignment for Language Models [55.2480439325792]
We introduce QAlign, a new test-time alignment approach. As we scale test-time compute, QAlign converges to sampling from the optimal aligned distribution for each individual prompt. By adopting recent advances in Markov chain Monte Carlo for text generation, our method enables better-aligned outputs without modifying the underlying model or even requiring logit access.
arXiv Detail & Related papers (2025-04-04T00:41:40Z)
ABFS: Natural Robustness Testing for LLM-based NLP Software [8.833542944724465]
Large Language Models (LLMs) in Natural Language Processing (NLP) software has rapidly gained traction across various domains. These applications frequently exhibit robustness deficiencies, where slight perturbations in input may lead to erroneous outputs. Current robustness testing methods face two main limitations: (1) low testing effectiveness, and (2) insufficient naturalness of test cases.
arXiv Detail & Related papers (2025-03-03T09:02:06Z)
S*: Test Time Scaling for Code Generation [55.11863577956177]
We propose S*, the first hybrid test-time scaling framework for code generation. S* substantially improves the coverage and selection accuracy of generated code.
arXiv Detail & Related papers (2025-02-20T09:18:53Z)
Scaling Test-Time Compute Without Verification or RL is Suboptimal [70.28430200655919]
We show that finetuning LLMs with verifier-based (VB) methods based on RL or search is far superior to verifier-free (VF) approaches based on distilling or cloning search traces, given a fixed amount of compute/data budget. We corroborate our theory empirically on both didactic and math reasoning problems with 3/8B-sized pre-trained LLMs, where we find verification is crucial for scaling test-time compute.
arXiv Detail & Related papers (2025-02-17T18:43:24Z)
Simple and Provable Scaling Laws for the Test-Time Compute of Large Language Models [70.07661254213181]
We propose two principled algorithms for the test-time compute of large language models. We prove theoretically that the failure probability of one algorithm decays to zero exponentially as its test-time compute grows.
arXiv Detail & Related papers (2024-11-29T05:29:47Z)
Scaling LLM Inference with Optimized Sample Compute Allocation [56.524278187351925]
We propose OSCA, an algorithm to find an optimal mix of different inference configurations. Our experiments show that with our learned mixed allocation, we can achieve accuracy better than the best single configuration. OSCA is also shown to be effective in agentic beyond single-turn tasks, achieving a better accuracy on SWE-Bench with 3x less compute than the default configuration.
arXiv Detail & Related papers (2024-10-29T19:17:55Z)
Scalable Similarity-Aware Test Suite Minimization with Reinforcement Learning [6.9290255098776425]
TripRL produces a diverse reduced test suite with high test effectiveness. We show that TripRL's runtime scales linearly with the magnitude of the Multi-Criteria Test Suite Minimization problem.
arXiv Detail & Related papers (2024-08-24T08:43:03Z)
On Test Sequence Generation using Multi-Objective Particle Swarm Optimization [0.2999888908665658]
Software testing is an important and essential part of the software development life cycle. In the software industry, testing costs can account for about 35% to 40% of the total cost of a software project.
arXiv Detail & Related papers (2024-04-09T18:35:21Z)
Active Test-Time Adaptation: Theoretical Analyses and An Algorithm [51.84691955495693]
Test-time adaptation (TTA) addresses distribution shifts for streaming test data in unsupervised settings. We propose the novel problem setting of active test-time adaptation (ATTA) that integrates active learning within the fully TTA setting.
arXiv Detail & Related papers (2024-04-07T22:31:34Z)
UniTSyn: A Large-Scale Dataset Capable of Enhancing the Prowess of Large Language Models for Program Testing [27.45301385265713]
We present a large-scale dataset UniTSyn, which is capable of enhancing the prowess of LLMs for Unit Test Synthesis. By leveraging Language Server Protocol, UniSyn achieves the challenging goal of collecting focal-test pairs without per-project execution setups or per-language setups. Experiments demonstrate that, by building an autoregressive model based on UniTSyn, we can achieve significant benefits in learning and understanding unit test representations.
arXiv Detail & Related papers (2024-02-04T22:48:05Z)
Code-Aware Prompting: A study of Coverage Guided Test Generation in Regression Setting using LLM [32.44432906540792]
We present SymPrompt, a code-aware prompting strategy for large language models in test generation. SymPrompt enhances correct test generations by a factor of 5 and bolsters relative coverage by 26% for CodeGen2. Notably, when applied to GPT-4, SymPrompt improves coverage by over 2x compared to baseline prompting strategies.
arXiv Detail & Related papers (2024-01-31T18:21:49Z)
Precise Error Rates for Computationally Efficient Testing [75.63895690909241]
We revisit the question of simple-versus-simple hypothesis testing with an eye towards computational complexity. An existing test based on linear spectral statistics achieves the best possible tradeoff curve between type I and type II error rates.
arXiv Detail & Related papers (2023-11-01T04:41:16Z)
LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits. Most LLMs struggle on SummEdits, with performance close to random chance. The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z)
Exact Paired-Permutation Testing for Structured Test Statistics [67.71280539312536]
We provide an efficient exact algorithm for the paired-permutation test for a family of structured test statistics. Our exact algorithm was $10$x faster than the Monte Carlo approximation with $20000$ samples on a common dataset.
arXiv Detail & Related papers (2022-05-03T11:00:59Z)
Noisy Adaptive Group Testing using Bayesian Sequential Experimental Design [63.48989885374238]
When the infection prevalence of a disease is low, Dorfman showed 80 years ago that testing groups of people can prove more efficient than testing people individually. Our goal in this paper is to propose new group testing algorithms that can operate in a noisy setting.
arXiv Detail & Related papers (2020-04-26T23:41:33Z)
Genetic Algorithms for Redundancy in Interaction Testing [0.6396288020763143]
Interaction testing involves designing a suite of tests, which guarantees to detect a fault if one exists among a small number of components interacting together. Existing algorithms for constructing these test suites usually involve one "fast" algorithm for generating most of the tests, and another "slower" algorithm to "complete" the test suite. We employ a genetic algorithm that generalizes these approaches that also incorporates redundancy by increasing the number of algorithms chosen, which we call "stages"
arXiv Detail & Related papers (2020-02-13T10:16:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.