Related papers: When and Why Test Generators for Deep Learning Produce Invalid Inputs: an Empirical Study

When and Why Test Generators for Deep Learning Produce Invalid Inputs: an Empirical Study

URL: http://arxiv.org/abs/2212.11368v1
Date: Wed, 21 Dec 2022 21:10:49 GMT
Title: When and Why Test Generators for Deep Learning Produce Invalid Inputs: an Empirical Study
Authors: Vincenzo Riccio and Paolo Tonella
Abstract summary: Testing Deep Learning (DL) based systems inherently requires large and representative test sets to evaluate whether DL systems generalise beyond their training datasets. Diverse Test Input Generators (TIGs) have been proposed to produce artificial inputs that expose issues of the DL systems by triggering misbehaviours. This paper investigates what extent TIGs can generate valid inputs, according to both automated and human validators.
Score: 4.632232395989182
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Testing Deep Learning (DL) based systems inherently requires large and representative test sets to evaluate whether DL systems generalise beyond their training datasets. Diverse Test Input Generators (TIGs) have been proposed to produce artificial inputs that expose issues of the DL systems by triggering misbehaviours. Unfortunately, such generated inputs may be invalid, i.e., not recognisable as part of the input domain, thus providing an unreliable quality assessment. Automated validators can ease the burden of manually checking the validity of inputs for human testers, although input validity is a concept difficult to formalise and, thus, automate. In this paper, we investigate to what extent TIGs can generate valid inputs, according to both automated and human validators. We conduct a large empirical study, involving 2 different automated validators, 220 human assessors, 5 different TIGs and 3 classification tasks. Our results show that 84% artificially generated inputs are valid, according to automated validators, but their expected label is not always preserved. Automated validators reach a good consensus with humans (78% accuracy), but still have limitations when dealing with feature-rich datasets.

Related papers

XMutant: XAI-based Fuzzing for Deep Learning Systems [6.878645239814823]
XMutant is a technique that leverages explainable artificial intelligence (XAI) techniques to generate challenging test inputs. Our studies showed that XMutant enables more effective and efficient test generation by focusing on the most impactful parts of the input.
arXiv Detail & Related papers (2025-03-10T12:05:49Z)
Test Input Validation for Vision-based DL Systems: An Active Learning Approach [3.760715803298828]
Testing deep learning (DL) systems requires extensive and diverse, yet valid, test inputs. We propose a test input validation approach for vision-based DL systems.
arXiv Detail & Related papers (2025-01-03T02:50:43Z)
Enriching Automatic Test Case Generation by Extracting Relevant Test Inputs from Bug Reports [8.85274953789614]
name is a technique for exploring bug reports to identify input values that can be fed to automatic test generation tools. For Defects4J projects, our study has shown that name successfully extracted 68.68% of relevant inputs when using regular expression in its approach.
arXiv Detail & Related papers (2023-12-22T18:19:33Z)
Test Generation Strategies for Building Failure Models and Explaining Spurious Failures [4.995172162560306]
Test inputs fail not only when the system under test is faulty but also when the inputs are invalid or unrealistic. We propose to build failure models for inferring interpretable rules on test inputs that cause spurious failures. We show that our proposed surrogate-assisted approach generates failure models with an average accuracy of 83%.
arXiv Detail & Related papers (2023-12-09T18:36:15Z)
Better Practices for Domain Adaptation [62.70267990659201]
Domain adaptation (DA) aims to provide frameworks for adapting models to deployment data without using labels. Unclear validation protocol for DA has led to bad practices in the literature. We show challenges across all three branches of domain adaptation methodology.
arXiv Detail & Related papers (2023-09-07T17:44:18Z)
From Static Benchmarks to Adaptive Testing: Psychometrics in AI Evaluation [60.14902811624433]
We discuss a paradigm shift from static evaluation methods to adaptive testing. This involves estimating the characteristics and value of each test item in the benchmark and dynamically adjusting items in real-time. We analyze the current approaches, advantages, and underlying reasons for adopting psychometrics in AI evaluation.
arXiv Detail & Related papers (2023-06-18T09:54:33Z)
Provable Robustness for Streaming Models with a Sliding Window [51.85182389861261]
In deep learning applications such as online content recommendation and stock market analysis, models use historical data to make predictions. We derive robustness certificates for models that use a fixed-size sliding window over the input stream. Our guarantees hold for the average model performance across the entire stream and are independent of stream size, making them suitable for large data streams.
arXiv Detail & Related papers (2023-03-28T21:02:35Z)
Comparing Shape-Constrained Regression Algorithms for Data Validation [0.0]
Industrial and scientific applications handle large volumes of data that render manual validation by humans infeasible. In this work, we compare different shape-constrained regression algorithms for the purpose of data validation based on their classification accuracy and runtime performance.
arXiv Detail & Related papers (2022-09-20T10:31:20Z)
Generating and Detecting True Ambiguity: A Forgotten Danger in DNN Supervision Testing [8.210473195536077]
We propose a novel way to generate ambiguous inputs to test Deep Neural Networks (DNNs) In particular, we propose AmbiGuess to generate ambiguous samples for image classification problems. We find that those best suited to detect true ambiguity perform worse on invalid, out-of-distribution and adversarial inputs and vice-versa.
arXiv Detail & Related papers (2022-07-21T14:21:34Z)
CAFA: Class-Aware Feature Alignment for Test-Time Adaptation [50.26963784271912]
Test-time adaptation (TTA) aims to address this challenge by adapting a model to unlabeled data at test time. We propose a simple yet effective feature alignment loss, termed as Class-Aware Feature Alignment (CAFA), which simultaneously encourages a model to learn target representations in a class-discriminative manner.
arXiv Detail & Related papers (2022-06-01T03:02:07Z)
Distribution-Aware Testing of Neural Networks Using Generative Models [5.618419134365903]
The reliability of software that has a Deep Neural Network (DNN) as a component is urgently important. We show that three recent testing techniques generate significant number of invalid test inputs. We propose a technique to incorporate the valid input space of the DNN model under test in the test generation process.
arXiv Detail & Related papers (2021-02-26T17:18:21Z)
Unsupervised Domain Adaptation for Speech Recognition via Uncertainty Driven Self-Training [55.824641135682725]
Domain adaptation experiments using WSJ as a source domain and TED-LIUM 3 as well as SWITCHBOARD show that up to 80% of the performance of a system trained on ground-truth data can be recovered.
arXiv Detail & Related papers (2020-11-26T18:51:26Z)
Improving Input-Output Linearizing Controllers for Bipedal Robots via Reinforcement Learning [85.13138591433635]
The main drawbacks of input-output linearizing controllers are the need for precise dynamics models and not being able to account for input constraints. In this paper, we address both challenges for the specific case of bipedal robot control by the use of reinforcement learning techniques.
arXiv Detail & Related papers (2020-04-15T18:15:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.