TEASMA: A Practical Methodology for Test Adequacy Assessment of Deep Neural Networks
- URL: http://arxiv.org/abs/2308.01311v4
- Date: Wed, 09 Oct 2024 20:37:13 GMT
- Title: TEASMA: A Practical Methodology for Test Adequacy Assessment of Deep Neural Networks
- Authors: Amin Abbasishahkoo, Mahboubeh Dadkhah, Lionel Briand, Dayi Lin,
- Abstract summary: TEASMA is a comprehensive and practical methodology designed to accurately assess the adequacy of test sets for Deep Neural Networks.
We evaluate TEASMA with four state-of-the-art test adequacy metrics: Distance-based Surprise Coverage (DSC), Likelihood-based Surprise Coverage (LSC), Input Distribution Coverage (IDC) and Mutation Score (MS)
- Score: 4.528286105252983
- License:
- Abstract: Successful deployment of Deep Neural Networks (DNNs) requires their validation with an adequate test set to ensure a sufficient degree of confidence in test outcomes. Although well-established test adequacy assessment techniques have been proposed for DNNs, we still need to investigate their application within a comprehensive methodology for accurately predicting the fault detection ability of test sets and thus assessing their adequacy. In this paper, we propose and evaluate TEASMA, a comprehensive and practical methodology designed to accurately assess the adequacy of test sets for DNNs. In practice, TEASMA allows engineers to decide whether they can trust high-accuracy test results and thus validate the DNN before its deployment. Based on a DNN model's training set, TEASMA provides a procedure to build accurate DNN-specific prediction models of the Fault Detection Rate (FDR) of a test set using an existing adequacy metric, thus enabling its assessment. We evaluated TEASMA with four state-of-the-art test adequacy metrics: Distance-based Surprise Coverage (DSC), Likelihood-based Surprise Coverage (LSC), Input Distribution Coverage (IDC), and Mutation Score (MS). Our extensive empirical evaluation across multiple DNN models and input sets such as ImageNet, reveals a strong linear correlation between the predicted and actual FDR values derived from MS, DSC, and IDC, with minimum R^2 values of 0.94 for MS and 0.90 for DSC and IDC. Furthermore, a low average Root Mean Square Error (RMSE) of 9% between actual and predicted FDR values across all subjects, when relying on regression analysis and MS, demonstrates the latter's superior accuracy when compared to DSC and IDC, with RMSE values of 0.17 and 0.18, respectively. Overall, these results suggest that TEASMA provides a reliable basis for confidently deciding whether to trust test results for DNN models.
Related papers
- DeepSample: DNN sampling-based testing for operational accuracy assessment [12.029919627622954]
Deep Neural Networks (DNN) are core components for classification and regression tasks of many software systems.
The challenge is to select a representative set of test inputs as small as possible to reduce the labelling cost.
This study presents DeepSample, a family of DNN testing techniques for cost-effective accuracy assessment.
arXiv Detail & Related papers (2024-03-28T09:56:26Z) - DeepKnowledge: Generalisation-Driven Deep Learning Testing [2.526146573337397]
DeepKnowledge is a systematic testing methodology for DNN-based systems.
It aims to enhance robustness and reduce the residual risk of 'black box' models.
We report improvements of up to 10 percentage points over state-of-the-art coverage criteria for detecting adversarial attacks.
arXiv Detail & Related papers (2024-03-25T13:46:09Z) - Evaluation of Out-of-Distribution Detection Performance on Autonomous
Driving Datasets [5.000404730573809]
Safety measures need to be systemically investigated to what extent they evaluate the intended performance of Deep Neural Networks (DNNs)
This work evaluates rejecting outputs from semantic segmentation DNNs by applying a Mahalanobis distance (MD) based on the most probable class-conditional Gaussian distribution for the predicted class as an OOD score.
The applicability of our findings will support legitimizing safety measures and motivate their usage when arguing for safe usage of DNNs in automotive perception.
arXiv Detail & Related papers (2024-01-30T13:49:03Z) - Conservative Prediction via Data-Driven Confidence Minimization [70.93946578046003]
In safety-critical applications of machine learning, it is often desirable for a model to be conservative.
We propose the Data-Driven Confidence Minimization framework, which minimizes confidence on an uncertainty dataset.
arXiv Detail & Related papers (2023-06-08T07:05:36Z) - D-Score: A White-Box Diagnosis Score for CNNs Based on Mutation
Operators [8.977819892091]
Convolutional neural networks (CNNs) have been widely applied in many safety-critical domains, such as autonomous driving and medical diagnosis.
We propose a white-box diagnostic approach that uses mutation operators and image transformation to calculate the feature and attention distribution of the model.
We also propose a D-Score based data augmentation method to enhance the CNN's performance to translations and rescalings.
arXiv Detail & Related papers (2023-04-03T03:13:59Z) - Towards Reliable Medical Image Segmentation by utilizing Evidential Calibrated Uncertainty [52.03490691733464]
We introduce DEviS, an easily implementable foundational model that seamlessly integrates into various medical image segmentation networks.
By leveraging subjective logic theory, we explicitly model probability and uncertainty for the problem of medical image segmentation.
DeviS incorporates an uncertainty-aware filtering module, which utilizes the metric of uncertainty-calibrated error to filter reliable data.
arXiv Detail & Related papers (2023-01-01T05:02:46Z) - Provably Robust Detection of Out-of-distribution Data (almost) for free [124.14121487542613]
Deep neural networks are known to produce highly overconfident predictions on out-of-distribution (OOD) data.
In this paper we propose a novel method where from first principles we combine a certifiable OOD detector with a standard classifier into an OOD aware classifier.
In this way we achieve the best of two worlds: certifiably adversarially robust OOD detection, even for OOD samples close to the in-distribution, without loss in prediction accuracy and close to state-of-the-art OOD detection performance for non-manipulated OOD data.
arXiv Detail & Related papers (2021-06-08T11:40:49Z) - Uncertainty-Aware Deep Calibrated Salient Object Detection [74.58153220370527]
Existing deep neural network based salient object detection (SOD) methods mainly focus on pursuing high network accuracy.
These methods overlook the gap between network accuracy and prediction confidence, known as the confidence uncalibration problem.
We introduce an uncertaintyaware deep SOD network, and propose two strategies to prevent deep SOD networks from being overconfident.
arXiv Detail & Related papers (2020-12-10T23:28:36Z) - UNITE: Uncertainty-based Health Risk Prediction Leveraging Multi-sourced
Data [81.00385374948125]
We present UNcertaInTy-based hEalth risk prediction (UNITE) model.
UNITE provides accurate disease risk prediction and uncertainty estimation leveraging multi-sourced health data.
We evaluate UNITE on real-world disease risk prediction tasks: nonalcoholic fatty liver disease (NASH) and Alzheimer's disease (AD)
UNITE achieves up to 0.841 in F1 score for AD detection, up to 0.609 in PR-AUC for NASH detection, and outperforms various state-of-the-art baselines by up to $19%$ over the best baseline.
arXiv Detail & Related papers (2020-10-22T02:28:11Z) - CovidDeep: SARS-CoV-2/COVID-19 Test Based on Wearable Medical Sensors
and Efficient Neural Networks [51.589769497681175]
The novel coronavirus (SARS-CoV-2) has led to a pandemic.
The current testing regime based on Reverse Transcription-Polymerase Chain Reaction for SARS-CoV-2 has been unable to keep up with testing demands.
We propose a framework called CovidDeep that combines efficient DNNs with commercially available WMSs for pervasive testing of the virus.
arXiv Detail & Related papers (2020-07-20T21:47:28Z) - Increasing Trustworthiness of Deep Neural Networks via Accuracy
Monitoring [20.456742449675904]
Inference accuracy of deep neural networks (DNNs) is a crucial performance metric, but can vary greatly in practice subject to actual test datasets.
This has raised significant concerns with trustworthiness of DNNs, especially in safety-critical applications.
We propose a neural network-based accuracy monitor model, which only takes the deployed DNN's softmax probability output as its input.
arXiv Detail & Related papers (2020-07-03T03:09:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.