Using Quality Attribute Scenarios for ML Model Test Case Generation
- URL: http://arxiv.org/abs/2406.08575v1
- Date: Wed, 12 Jun 2024 18:26:42 GMT
- Title: Using Quality Attribute Scenarios for ML Model Test Case Generation
- Authors: Rachel Brower-Sinning, Grace A. Lewis, SebastÃan EcheverrÃa, Ipek Ozkaya,
- Abstract summary: Current practice for machine learning (ML) model testing prioritizes testing for model performance.
This paper presents an approach based on quality attribute (QA) scenarios to elicit and define system- and model-relevant test cases.
The QA-based approach has been integrated into MLTE, a process and tool to support ML model test and evaluation.
- Score: 3.9111051646728527
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Testing of machine learning (ML) models is a known challenge identified by researchers and practitioners alike. Unfortunately, current practice for ML model testing prioritizes testing for model performance, while often neglecting the requirements and constraints of the ML-enabled system that integrates the model. This limited view of testing leads to failures during integration, deployment, and operations, contributing to the difficulties of moving models from development to production. This paper presents an approach based on quality attribute (QA) scenarios to elicit and define system- and model-relevant test cases for ML models. The QA-based approach described in this paper has been integrated into MLTE, a process and tool to support ML model test and evaluation. Feedback from users of MLTE highlights its effectiveness in testing beyond model performance and identifying failures early in the development process.
Related papers
- Context-Aware Testing: A New Paradigm for Model Testing with Large Language Models [49.06068319380296]
We introduce context-aware testing (CAT) which uses context as an inductive bias to guide the search for meaningful model failures.
We instantiate the first CAT system, SMART Testing, which employs large language models to hypothesize relevant and likely failures.
arXiv Detail & Related papers (2024-10-31T15:06:16Z) - SYNTHEVAL: Hybrid Behavioral Testing of NLP Models with Synthetic CheckLists [59.08999823652293]
We propose SYNTHEVAL to generate a wide range of test types for a comprehensive evaluation of NLP models.
In the last stage, human experts investigate the challenging examples, manually design templates, and identify the types of failures the taskspecific models consistently exhibit.
We apply SYNTHEVAL to two classification tasks, sentiment analysis and toxic language detection, and show that our framework is effective in identifying weaknesses of strong models on these tasks.
arXiv Detail & Related papers (2024-08-30T17:41:30Z) - Outline of an Independent Systematic Blackbox Test for ML-based Systems [0.0]
This article proposes a test procedure that can be used to test ML models and ML-based systems independently of the actual training process.
In this way, the typical quality statements such as accuracy and precision of these models and systems can be verified independently.
arXiv Detail & Related papers (2024-01-30T14:41:28Z) - Test Generation Strategies for Building Failure Models and Explaining
Spurious Failures [4.995172162560306]
Test inputs fail not only when the system under test is faulty but also when the inputs are invalid or unrealistic.
We propose to build failure models for inferring interpretable rules on test inputs that cause spurious failures.
We show that our proposed surrogate-assisted approach generates failure models with an average accuracy of 83%.
arXiv Detail & Related papers (2023-12-09T18:36:15Z) - Continuous Management of Machine Learning-Based Application Behavior [3.316045828362788]
Non-functional properties of Machine Learning models must be monitored, verified, and maintained.
We propose a multi-model approach that aims to guarantee a stable non-functional behavior of ML-based applications.
We experimentally evaluate our solution in a real-world scenario focusing on non-functional property fairness.
arXiv Detail & Related papers (2023-11-21T15:47:06Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z) - Learning continuous models for continuous physics [94.42705784823997]
We develop a test based on numerical analysis theory to validate machine learning models for science and engineering applications.
Our results illustrate how principled numerical analysis methods can be coupled with existing ML training/testing methodologies to validate models for science and engineering applications.
arXiv Detail & Related papers (2022-02-17T07:56:46Z) - Active Surrogate Estimators: An Active Learning Approach to
Label-Efficient Model Evaluation [59.7305309038676]
We propose Active Surrogate Estimators (ASEs) for model evaluation.
We find that ASEs offer greater label-efficiency than the current state-of-the-art.
arXiv Detail & Related papers (2022-02-14T17:15:18Z) - Mutation Testing framework for Machine Learning [0.0]
Failure of Machine Learning Models can lead to severe consequences in terms of loss of life or property.
Developers, scientists, and ML community around the world, must build a highly reliable test architecture for critical ML application.
This article provides an insight journey of Machine Learning Systems (MLS) testing, its evolution, current paradigm and future work.
arXiv Detail & Related papers (2021-02-19T18:02:31Z) - DirectDebug: Automated Testing and Debugging of Feature Models [55.41644538483948]
Variability models (e.g., feature models) are a common way for the representation of variabilities and commonalities of software artifacts.
Complex and often large-scale feature models can become faulty, i.e., do not represent the expected variability properties of the underlying software artifact.
arXiv Detail & Related papers (2021-02-11T11:22:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.