Smoke Testing for Machine Learning: Simple Tests to Discover Severe
Defects
- URL: http://arxiv.org/abs/2009.01521v2
- Date: Fri, 29 Oct 2021 07:15:39 GMT
- Title: Smoke Testing for Machine Learning: Simple Tests to Discover Severe
Defects
- Authors: Steffen Herbold, Tobias Haar
- Abstract summary: We try to determine generic and simple smoke tests that can be used to assert that basic functions can be executed without crashing.
We were able to find bugs in all three machine learning libraries that we tested and severe bugs in two of the three libraries.
- Score: 7.081604594416339
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Machine learning is nowadays a standard technique for data analysis within
software applications. Software engineers need quality assurance techniques
that are suitable for these new kinds of systems. Within this article, we
discuss the question whether standard software testing techniques that have
been part of textbooks since decades are also useful for the testing of machine
learning software. Concretely, we try to determine generic and simple smoke
tests that can be used to assert that basic functions can be executed without
crashing. We found that we can derive such tests using techniques similar to
equivalence classes and boundary value analysis. Moreover, we found that these
concepts can also be applied to hyperparameters, to further improve the quality
of the smoke tests. Even though our approach is almost trivial, we were able to
find bugs in all three machine learning libraries that we tested and severe
bugs in two of the three libraries. This demonstrates that common software
testing techniques are still valid in the age of machine learning and that
considerations how they can be adapted to this new context can help to find and
prevent severe bugs, even in mature machine learning libraries.
Related papers
- A Comprehensive Study on Automated Testing with the Software Lifecycle [0.6144680854063939]
The research examines how automated testing makes it easier to evaluate software quality, how it saves time as compared to manual testing, and how it differs from each of them in terms of benefits and drawbacks.
The process of testing software applications is simplified, customized to certain testing situations, and can be successfully carried out by using automated testing tools.
arXiv Detail & Related papers (2024-05-02T06:30:37Z) - Automatic Static Bug Detection for Machine Learning Libraries: Are We
There Yet? [14.917820383894124]
We analyze five popular and widely used static bug detectors, i.e., Flawfinder, RATS, Cppcheck, Facebook Infer, and Clang, on a curated dataset of software bugs.
Overall, our study shows that static bug detectors find a negligible amount of all bugs accounting for 6/410 bugs (0.01%), Flawfinder and RATS are the most effective static checker for finding software bugs in machine learning libraries.
arXiv Detail & Related papers (2023-07-09T01:38:52Z) - A Survey of Machine Unlearning [56.017968863854186]
Recent regulations now require that, on request, private information about a user must be removed from computer systems.
ML models often remember' the old data.
Recent works on machine unlearning have not been able to completely solve the problem.
arXiv Detail & Related papers (2022-09-06T08:51:53Z) - Software Testing for Machine Learning [13.021014899410684]
Machine learning has shown to be susceptible to deception, leading to errors and even fatal failures.
This circumstance calls into question the widespread use of machine learning, especially in safety-critical applications.
This summary talk discusses the current state-of-the-art of software testing for machine learning.
arXiv Detail & Related papers (2022-04-30T08:47:10Z) - Human-in-the-Loop Disinformation Detection: Stance, Sentiment, or
Something Else? [93.91375268580806]
Both politics and pandemics have recently provided ample motivation for the development of machine learning-enabled disinformation (a.k.a. fake news) detection algorithms.
Existing literature has focused primarily on the fully-automated case, but the resulting techniques cannot reliably detect disinformation on the varied topics, sources, and time scales required for military applications.
By leveraging an already-available analyst as a human-in-the-loop, canonical machine learning techniques of sentiment analysis, aspect-based sentiment analysis, and stance detection become plausible methods to use for a partially-automated disinformation detection system.
arXiv Detail & Related papers (2021-11-09T13:30:34Z) - Learn then Test: Calibrating Predictive Algorithms to Achieve Risk
Control [67.52000805944924]
Learn then Test (LTT) is a framework for calibrating machine learning models.
Our main insight is to reframe the risk-control problem as multiple hypothesis testing.
We use our framework to provide new calibration methods for several core machine learning tasks with detailed worked examples in computer vision.
arXiv Detail & Related papers (2021-10-03T17:42:03Z) - Discovering Boundary Values of Feature-based Machine Learning
Classifiers through Exploratory Datamorphic Testing [7.8729820663730035]
This paper proposes a set of testing strategies for testing machine learning applications in the framework of the datamorphism testing methodology.
Three variants of exploratory strategies are presented with the algorithms implemented in the automated datamorphic testing tool Morphy.
Their capability and cost of discovering borders between classes are evaluated via a set of controlled experiments with manually designed subjects and a set of case studies with real machine learning models.
arXiv Detail & Related papers (2021-10-01T11:47:56Z) - Uncertainty Baselines: Benchmarks for Uncertainty & Robustness in Deep
Learning [66.59455427102152]
We introduce Uncertainty Baselines: high-quality implementations of standard and state-of-the-art deep learning methods on a variety of tasks.
Each baseline is a self-contained experiment pipeline with easily reusable and extendable components.
We provide model checkpoints, experiment outputs as Python notebooks, and leaderboards for comparing results.
arXiv Detail & Related papers (2021-06-07T23:57:32Z) - Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation.
Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges.
Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z) - ALT-MAS: A Data-Efficient Framework for Active Testing of Machine
Learning Algorithms [58.684954492439424]
We propose a novel framework to efficiently test a machine learning model using only a small amount of labeled test data.
The idea is to estimate the metrics of interest for a model-under-test using Bayesian neural network (BNN)
arXiv Detail & Related papers (2021-04-11T12:14:04Z) - Automated Content Grading Using Machine Learning [0.0]
This research project is a primitive experiment in the automation of grading of theoretical answers written in exams by students in technical courses.
We show how the algorithmic approach in machine learning can be used to automatically examine and grade theoretical content in exam answer papers.
arXiv Detail & Related papers (2020-04-08T23:46:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.