Generalized Coverage Criteria for Combinatorial Sequence Testing
- URL: http://arxiv.org/abs/2201.00522v4
- Date: Tue, 31 Oct 2023 07:34:39 GMT
- Title: Generalized Coverage Criteria for Combinatorial Sequence Testing
- Authors: Achiya Elyasaf, Eitan Farchi, Oded Margalit, Gera Weiss, Yeshayahu
Weiss
- Abstract summary: We present a new model-based approach for testing systems that use sequences of actions and assertions as test vectors.
Our solution includes a method for quantifying testing quality, a tool for generating high-quality test suites based on the coverage criteria we propose, and a framework for assessing risks.
- Score: 4.807321976136717
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a new model-based approach for testing systems that use sequences
of actions and assertions as test vectors. Our solution includes a method for
quantifying testing quality, a tool for generating high-quality test suites
based on the coverage criteria we propose, and a framework for assessing risks.
For testing quality, we propose a method that specifies generalized coverage
criteria over sequences of actions, which extends previous approaches. Our
publicly available tool demonstrates how to extract effective test suites from
test plans based on these criteria. We also present a Bayesian approach for
measuring the probabilities of bugs or risks, and show how this quantification
can help achieve an informed balance between exploitation and exploration in
testing. Finally, we provide an empirical evaluation demonstrating the
effectiveness of our tool in finding bugs, assessing risks, and achieving
coverage.
Related papers
- Human-Calibrated Automated Testing and Validation of Generative Language Models [3.2855317710497633]
This paper introduces a comprehensive framework for the evaluation and validation of generative language models (GLMs)
It focuses on Retrieval-Augmented Generation (RAG) systems deployed in high-stakes domains such as banking.
arXiv Detail & Related papers (2024-11-25T13:53:36Z) - Beyond ELBOs: A Large-Scale Evaluation of Variational Methods for Sampling [14.668634411361307]
We introduce a benchmark that evaluates sampling methods using a standardized task suite and a broad range of performance criteria.
We study existing metrics for quantifying mode collapse and introduce novel metrics for this purpose.
arXiv Detail & Related papers (2024-06-11T16:23:33Z) - Backdoor-based Explainable AI Benchmark for High Fidelity Evaluation of Attribution Methods [49.62131719441252]
Attribution methods compute importance scores for input features to explain the output predictions of deep models.
In this work, we first identify a set of fidelity criteria that reliable benchmarks for attribution methods are expected to fulfill.
We then introduce a Backdoor-based eXplainable AI benchmark (BackX) that adheres to the desired fidelity criteria.
arXiv Detail & Related papers (2024-05-02T13:48:37Z) - Few-Shot Scenario Testing for Autonomous Vehicles Based on Neighborhood Coverage and Similarity [8.97909097472183]
Testing and evaluating safety performance of autonomous vehicles (AVs) is essential before the large-scale deployment.
The number of testing scenarios permissible for a specific AV is severely limited by tight constraints on testing budgets and time.
We formulate this problem for the first time the "few-shot testing" (FST) problem and propose a systematic framework to address this challenge.
arXiv Detail & Related papers (2024-02-02T04:47:14Z) - Measuring Software Testability via Automatically Generated Test Cases [8.17364116624769]
We propose a new approach to pursuing testability measurements based on software metrics.
Our approach exploits automatic test generation and mutation analysis to quantify the evidence about the relative hardness of developing effective test cases.
arXiv Detail & Related papers (2023-07-30T09:48:51Z) - From Static Benchmarks to Adaptive Testing: Psychometrics in AI Evaluation [60.14902811624433]
We discuss a paradigm shift from static evaluation methods to adaptive testing.
This involves estimating the characteristics and value of each test item in the benchmark and dynamically adjusting items in real-time.
We analyze the current approaches, advantages, and underlying reasons for adopting psychometrics in AI evaluation.
arXiv Detail & Related papers (2023-06-18T09:54:33Z) - Uncertainty-Driven Action Quality Assessment [67.20617610820857]
We propose a novel probabilistic model, named Uncertainty-Driven AQA (UD-AQA), to capture the diversity among multiple judge scores.
We generate the estimation of uncertainty for each prediction, which is employed to re-weight AQA regression loss.
Our proposed method achieves competitive results on three benchmarks including the Olympic events MTL-AQA and FineDiving, and the surgical skill JIGSAWS datasets.
arXiv Detail & Related papers (2022-07-29T07:21:15Z) - Learn then Test: Calibrating Predictive Algorithms to Achieve Risk
Control [67.52000805944924]
Learn then Test (LTT) is a framework for calibrating machine learning models.
Our main insight is to reframe the risk-control problem as multiple hypothesis testing.
We use our framework to provide new calibration methods for several core machine learning tasks with detailed worked examples in computer vision.
arXiv Detail & Related papers (2021-10-03T17:42:03Z) - Group Testing with Non-identical Infection Probabilities [59.96266198512243]
We develop an adaptive group testing algorithm using the set formation method.
We show that our algorithm outperforms the state of the art, and performs close to the entropy lower bound.
arXiv Detail & Related papers (2021-08-27T17:53:25Z) - Feedback Effects in Repeat-Use Criminal Risk Assessments [0.0]
We show that risk can propagate over sequential decisions in ways that are not captured by one-shot tests.
Risk assessment tools operate in a highly complex and path-dependent process, fraught with historical inequity.
arXiv Detail & Related papers (2020-11-28T06:40:05Z) - SAMBA: Safe Model-Based & Active Reinforcement Learning [59.01424351231993]
SAMBA is a framework for safe reinforcement learning that combines aspects from probabilistic modelling, information theory, and statistics.
We evaluate our algorithm on a variety of safe dynamical system benchmarks involving both low and high-dimensional state representations.
We provide intuition as to the effectiveness of the framework by a detailed analysis of our active metrics and safety constraints.
arXiv Detail & Related papers (2020-06-12T10:40:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.