Related papers: Adaptive Testing for LLM-Based Applications: A Diversity-based Approach

Adaptive Testing for LLM-Based Applications: A Diversity-based Approach

URL: http://arxiv.org/abs/2501.13480v1
Date: Thu, 23 Jan 2025 08:53:12 GMT
Title: Adaptive Testing for LLM-Based Applications: A Diversity-based Approach
Authors: Juyeon Yoon, Robert Feldt, Shin Yoo,
Abstract summary: We show that diversity-based testing techniques, such as Adaptive Random Testing (ART), can be effectively applied to the testing of prompt templates.<n>Our results, obtained using various implementations that explore several string-based distances, confirm that our approach enables the discovery of failures with reduced testing budgets.
Score: 15.33985438101206
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The recent surge of building software systems powered by Large Language Models (LLMs) has led to the development of various testing frameworks, primarily focused on treating prompt templates as the unit of testing. Despite the significant costs associated with test input execution and output assessment, the curation of optimized test suites is yet overlooked in these tools, which calls for tailored test selection or prioritization strategies. In this paper, we show that diversity-based testing techniques, such as Adaptive Random Testing (ART) with appropriate string distance metrics, can be effectively applied to the testing of prompt templates. Our proposed adaptive testing approach adjusts the conventional ART process to this context by selecting new test inputs based on scores derived from existing test suite and their labelling results. Our results, obtained using various implementations that explore several string-based distances, confirm that our approach enables the discovery of failures with reduced testing budgets and promotes the generation of more varied outputs.

Related papers

Scoring Verifiers: Evaluating Synthetic Verification for Code and Reasoning [59.25951947621526]
We propose an approach which can transform existing coding benchmarks into scoring and ranking datasets to evaluate the effectiveness of synthetic verifiers. We release four new benchmarks (HE-R, HE-R+, MBPP-R, and MBPP-R+), and analyzed synthetic verification methods with standard, reasoning-based, and reward-based LLMs. Our experiments show that reasoning can significantly improve test case generation and that scaling the number of test cases enhances the verification accuracy.
arXiv Detail & Related papers (2025-02-19T15:32:11Z)
Context-Aware Testing: A New Paradigm for Model Testing with Large Language Models [49.06068319380296]
We introduce context-aware testing (CAT) which uses context as an inductive bias to guide the search for meaningful model failures. We instantiate the first CAT system, SMART Testing, which employs large language models to hypothesize relevant and likely failures.
arXiv Detail & Related papers (2024-10-31T15:06:16Z)
Prompt Optimization with EASE? Efficient Ordering-aware Automated Selection of Exemplars [66.823588073584]
Large language models (LLMs) have shown impressive capabilities in real-world applications. The quality of these exemplars in the prompt greatly impacts performance. Existing methods fail to adequately account for the impact of exemplar ordering on the performance.
arXiv Detail & Related papers (2024-05-25T08:23:05Z)
Measuring Software Testability via Automatically Generated Test Cases [8.17364116624769]
We propose a new approach to pursuing testability measurements based on software metrics. Our approach exploits automatic test generation and mutation analysis to quantify the evidence about the relative hardness of developing effective test cases.
arXiv Detail & Related papers (2023-07-30T09:48:51Z)
Validation of massively-parallel adaptive testing using dynamic control matching [0.0]
Modern businesses often run many A/B/n tests at the same time and in parallel, and package many content variations into the same messages. This paper presents a method for disentangling the causal effects of the various tests under conditions of continuous test adaptation.
arXiv Detail & Related papers (2023-05-02T11:28:12Z)
A Comprehensive Survey on Test-Time Adaptation under Distribution Shifts [117.72709110877939]
Test-time adaptation (TTA) has the potential to adapt a pre-trained model to unlabeled data during testing, before making predictions.<n>We categorize TTA into several distinct groups based on the form of test data, namely, test-time domain adaptation, test-time batch adaptation, and online test-time adaptation.
arXiv Detail & Related papers (2023-03-27T16:32:21Z)
Hybrid Intelligent Testing in Simulation-Based Verification [0.0]
Several millions of tests may be required to achieve coverage goals. Coverage-Directed Test Selection learns from coverage feedback to bias testing towards the most effective tests. Novelty-Driven Verification learns to identify and simulate stimuli that differ from previous stimuli.
arXiv Detail & Related papers (2022-05-19T13:22:08Z)
TTAPS: Test-Time Adaption by Aligning Prototypes using Self-Supervision [70.05605071885914]
We propose a novel modification of the self-supervised training algorithm SwAV that adds the ability to adapt to single test samples. We show the success of our method on the common benchmark dataset CIFAR10-C.
arXiv Detail & Related papers (2022-05-18T05:43:06Z)
Listen, Adapt, Better WER: Source-free Single-utterance Test-time Adaptation for Automatic Speech Recognition [65.84978547406753]
Test-time Adaptation aims to adapt the model trained on source domains to yield better predictions for test samples. Single-Utterance Test-time Adaptation (SUTA) is the first TTA study in speech area to our best knowledge.
arXiv Detail & Related papers (2022-03-27T06:38:39Z)
Machine Learning Testing in an ADAS Case Study Using Simulation-Integrated Bio-Inspired Search-Based Testing [7.5828169434922]
Deeper generates failure-revealing test scenarios for testing a deep neural network-based lane-keeping system. In the newly proposed version, we utilize a new set of bio-inspired search algorithms, genetic algorithm (GA), $(mu+lambda)$ and $(mu,lambda)$ evolution strategies (ES), and particle swarm optimization (PSO) Our evaluation shows the newly proposed test generators in Deeper represent a considerable improvement on the previous version.
arXiv Detail & Related papers (2022-03-22T20:27:40Z)
Online GANs for Automatic Performance Testing [0.10312968200748115]
We present a novel algorithm for automatic performance testing that uses an online variant of the Generative Adversarial Network (GAN) The proposed approach does not require a prior training set or model of the system under test. We consider that the presented algorithm serves as a proof of concept and we hope that it can spark a research discussion on the application of GANs to test generation.
arXiv Detail & Related papers (2021-04-21T06:03:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.