Related papers: Smoke Testing for Machine Learning: Simple Tests to Discover Severe Defects

Smoke Testing for Machine Learning: Simple Tests to Discover Severe Defects

URL: http://arxiv.org/abs/2009.01521v2
Date: Fri, 29 Oct 2021 07:15:39 GMT
Title: Smoke Testing for Machine Learning: Simple Tests to Discover Severe Defects
Authors: Steffen Herbold, Tobias Haar
Abstract summary: We try to determine generic and simple smoke tests that can be used to assert that basic functions can be executed without crashing. We were able to find bugs in all three machine learning libraries that we tested and severe bugs in two of the three libraries.
Score: 7.081604594416339
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Machine learning is nowadays a standard technique for data analysis within software applications. Software engineers need quality assurance techniques that are suitable for these new kinds of systems. Within this article, we discuss the question whether standard software testing techniques that have been part of textbooks since decades are also useful for the testing of machine learning software. Concretely, we try to determine generic and simple smoke tests that can be used to assert that basic functions can be executed without crashing. We found that we can derive such tests using techniques similar to equivalence classes and boundary value analysis. Moreover, we found that these concepts can also be applied to hyperparameters, to further improve the quality of the smoke tests. Even though our approach is almost trivial, we were able to find bugs in all three machine learning libraries that we tested and severe bugs in two of the three libraries. This demonstrates that common software testing techniques are still valid in the age of machine learning and that considerations how they can be adapted to this new context can help to find and prevent severe bugs, even in mature machine learning libraries.

Related papers

Benchmarking Predictive Coding Networks -- Made Simple [48.652114040426625]
We tackle the problems of efficiency and scalability for predictive coding networks (PCNs) in machine learning. We propose a library, called PCX, that focuses on performance and simplicity, and use it to implement a large set of standard benchmarks. We perform extensive tests on such benchmarks using both existing algorithms for PCNs, as well as adaptations of other methods popular in the bio-plausible deep learning community.
arXiv Detail & Related papers (2024-07-01T10:33:44Z)
A Comprehensive Study on Automated Testing with the Software Lifecycle [0.6144680854063939]
The research examines how automated testing makes it easier to evaluate software quality, how it saves time as compared to manual testing, and how it differs from each of them in terms of benefits and drawbacks. The process of testing software applications is simplified, customized to certain testing situations, and can be successfully carried out by using automated testing tools.
arXiv Detail & Related papers (2024-05-02T06:30:37Z)
Automatic Static Bug Detection for Machine Learning Libraries: Are We There Yet? [14.917820383894124]
We analyze five popular and widely used static bug detectors, i.e., Flawfinder, RATS, Cppcheck, Facebook Infer, and Clang, on a curated dataset of software bugs. Overall, our study shows that static bug detectors find a negligible amount of all bugs accounting for 6/410 bugs (0.01%), Flawfinder and RATS are the most effective static checker for finding software bugs in machine learning libraries.
arXiv Detail & Related papers (2023-07-09T01:38:52Z)
A Survey of Machine Unlearning [56.017968863854186]
Recent regulations now require that, on request, private information about a user must be removed from computer systems. ML models often remember' the old data. Recent works on machine unlearning have not been able to completely solve the problem.
arXiv Detail & Related papers (2022-09-06T08:51:53Z)
Software Testing for Machine Learning [13.021014899410684]
Machine learning has shown to be susceptible to deception, leading to errors and even fatal failures. This circumstance calls into question the widespread use of machine learning, especially in safety-critical applications. This summary talk discusses the current state-of-the-art of software testing for machine learning.
arXiv Detail & Related papers (2022-04-30T08:47:10Z)
Human-in-the-Loop Disinformation Detection: Stance, Sentiment, or Something Else? [93.91375268580806]
Both politics and pandemics have recently provided ample motivation for the development of machine learning-enabled disinformation (a.k.a. fake news) detection algorithms. Existing literature has focused primarily on the fully-automated case, but the resulting techniques cannot reliably detect disinformation on the varied topics, sources, and time scales required for military applications. By leveraging an already-available analyst as a human-in-the-loop, canonical machine learning techniques of sentiment analysis, aspect-based sentiment analysis, and stance detection become plausible methods to use for a partially-automated disinformation detection system.
arXiv Detail & Related papers (2021-11-09T13:30:34Z)
Learn then Test: Calibrating Predictive Algorithms to Achieve Risk Control [67.52000805944924]
Learn then Test (LTT) is a framework for calibrating machine learning models. Our main insight is to reframe the risk-control problem as multiple hypothesis testing. We use our framework to provide new calibration methods for several core machine learning tasks with detailed worked examples in computer vision.
arXiv Detail & Related papers (2021-10-03T17:42:03Z)
Discovering Boundary Values of Feature-based Machine Learning Classifiers through Exploratory Datamorphic Testing [7.8729820663730035]
This paper proposes a set of testing strategies for testing machine learning applications in the framework of the datamorphism testing methodology. Three variants of exploratory strategies are presented with the algorithms implemented in the automated datamorphic testing tool Morphy. Their capability and cost of discovering borders between classes are evaluated via a set of controlled experiments with manually designed subjects and a set of case studies with real machine learning models.
arXiv Detail & Related papers (2021-10-01T11:47:56Z)
Uncertainty Baselines: Benchmarks for Uncertainty & Robustness in Deep Learning [66.59455427102152]
We introduce Uncertainty Baselines: high-quality implementations of standard and state-of-the-art deep learning methods on a variety of tasks. Each baseline is a self-contained experiment pipeline with easily reusable and extendable components. We provide model checkpoints, experiment outputs as Python notebooks, and leaderboards for comparing results.
arXiv Detail & Related papers (2021-06-07T23:57:32Z)
Measuring Coding Challenge Competence With APPS [54.22600767666257]
We introduce APPS, a benchmark for code generation. Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges. Recent models such as GPT-Neo can pass approximately 15% of the test cases of introductory problems.
arXiv Detail & Related papers (2021-05-20T17:58:42Z)
ALT-MAS: A Data-Efficient Framework for Active Testing of Machine Learning Algorithms [58.684954492439424]
We propose a novel framework to efficiently test a machine learning model using only a small amount of labeled test data. The idea is to estimate the metrics of interest for a model-under-test using Bayesian neural network (BNN)
arXiv Detail & Related papers (2021-04-11T12:14:04Z)
Automated Content Grading Using Machine Learning [0.0]
This research project is a primitive experiment in the automation of grading of theoretical answers written in exams by students in technical courses. We show how the algorithmic approach in machine learning can be used to automatically examine and grade theoretical content in exam answer papers.
arXiv Detail & Related papers (2020-04-08T23:46:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.