Related papers: Differential testing for machine learning: an analysis for classification algorithms beyond deep learning

Differential testing for machine learning: an analysis for classification algorithms beyond deep learning

URL: http://arxiv.org/abs/2207.11976v1
Date: Mon, 25 Jul 2022 08:27:01 GMT
Title: Differential testing for machine learning: an analysis for classification algorithms beyond deep learning
Authors: Steffen Herbold, Steffen Tunkel
Abstract summary: We conduct a case study using Scikit-learn, Weka, Spark MLlib, and Caret. We identify the potential of differential testing by considering which algorithms are available in multiple frameworks. The feasibility seems limited because often it is not possible to determine configurations that are the same in other frameworks.
Score: 7.081604594416339
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Context: Differential testing is a useful approach that uses different implementations of the same algorithms and compares the results for software testing. In recent years, this approach was successfully used for test campaigns of deep learning frameworks. Objective: There is little knowledge on the application of differential testing beyond deep learning. Within this article, we want to close this gap for classification algorithms. Method: We conduct a case study using Scikit-learn, Weka, Spark MLlib, and Caret in which we identify the potential of differential testing by considering which algorithms are available in multiple frameworks, the feasibility by identifying pairs of algorithms that should exhibit the same behavior, and the effectiveness by executing tests for the identified pairs and analyzing the deviations. Results: While we found a large potential for popular algorithms, the feasibility seems limited because often it is not possible to determine configurations that are the same in other frameworks. The execution of the feasible tests revealed that there is a large amount of deviations for the scores and classes. Only a lenient approach based on statistical significance of classes does not lead to a huge amount of test failures. Conclusions: The potential of differential testing beyond deep learning seems limited for research into the quality of machine learning libraries. Practitioners may still use the approach if they have deep knowledge about implementations, especially if a coarse oracle that only considers significant differences of classes is sufficient.

Related papers

Computability of Classification and Deep Learning: From Theoretical Limits to Practical Feasibility through Quantization [53.15874572081944]
We study computability in the deep learning framework from two perspectives. We show algorithmic limitations in training deep neural networks even in cases where the underlying problem is well-behaved. Finally, we show that in quantized versions of classification and deep network training, computability restrictions do not arise or can be overcome to a certain degree.
arXiv Detail & Related papers (2024-08-12T15:02:26Z)
Towards Explainable Test Case Prioritisation with Learning-to-Rank Models [6.289767078502329]
Test case prioritisation ( TCP) is a critical task in regression testing to ensure quality as software evolves. We present and discuss scenarios that require different explanations and how the particularities of TCP could influence them.
arXiv Detail & Related papers (2024-05-22T16:11:45Z)
Can Tree Based Approaches Surpass Deep Learning in Anomaly Detection? A Benchmarking Study [0.6291443816903801]
This paper evaluates a diverse array of machine learning-based anomaly detection algorithms. The paper contributes significantly by conducting an unbiased comparison of various anomaly detection algorithms.
arXiv Detail & Related papers (2024-02-11T19:12:51Z)
Multi-Dimensional Ability Diagnosis for Machine Learning Algorithms [88.93372675846123]
We propose a task-agnostic evaluation framework Camilla for evaluating machine learning algorithms. We use cognitive diagnosis assumptions and neural networks to learn the complex interactions among algorithms, samples and the skills of each sample. In our experiments, Camilla outperforms state-of-the-art baselines on the metric reliability, rank consistency and rank stability.
arXiv Detail & Related papers (2023-07-14T03:15:56Z)
Active Learning Principles for In-Context Learning with Large Language Models [65.09970281795769]
This paper investigates how Active Learning algorithms can serve as effective demonstration selection methods for in-context learning. We show that in-context example selection through AL prioritizes high-quality examples that exhibit low uncertainty and bear similarity to the test examples.
arXiv Detail & Related papers (2023-05-23T17:16:04Z)
Towards Diverse Evaluation of Class Incremental Learning: A Representation Learning Perspective [67.45111837188685]
Class incremental learning (CIL) algorithms aim to continually learn new object classes from incrementally arriving data. We experimentally analyze neural network models trained by CIL algorithms using various evaluation protocols in representation learning.
arXiv Detail & Related papers (2022-06-16T11:44:11Z)
Discovering Boundary Values of Feature-based Machine Learning Classifiers through Exploratory Datamorphic Testing [7.8729820663730035]
This paper proposes a set of testing strategies for testing machine learning applications in the framework of the datamorphism testing methodology. Three variants of exploratory strategies are presented with the algorithms implemented in the automated datamorphic testing tool Morphy. Their capability and cost of discovering borders between classes are evaluated via a set of controlled experiments with manually designed subjects and a set of case studies with real machine learning models.
arXiv Detail & Related papers (2021-10-01T11:47:56Z)
Uncertainty Baselines: Benchmarks for Uncertainty & Robustness in Deep Learning [66.59455427102152]
We introduce Uncertainty Baselines: high-quality implementations of standard and state-of-the-art deep learning methods on a variety of tasks. Each baseline is a self-contained experiment pipeline with easily reusable and extendable components. We provide model checkpoints, experiment outputs as Python notebooks, and leaderboards for comparing results.
arXiv Detail & Related papers (2021-06-07T23:57:32Z)
Theoretical Insights Into Multiclass Classification: A High-dimensional Asymptotic View [82.80085730891126]
We provide the first modernally precise analysis of linear multiclass classification. Our analysis reveals that the classification accuracy is highly distribution-dependent. The insights gained may pave the way for a precise understanding of other classification algorithms.
arXiv Detail & Related papers (2020-11-16T05:17:29Z)
Probabilistic Diagnostic Tests for Degradation Problems in Supervised Learning [0.0]
Problems such as class imbalance, overlapping, small-disjuncts, noisy labels, and sparseness limit accuracy in classification algorithms. Probability diagnostic model based on identifying signs and symptoms of each problem is presented. Behavior and performance of several supervised algorithms are studied when training sets have such problems.
arXiv Detail & Related papers (2020-04-06T20:32:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.