Related papers: A Model-Agnostic Algorithm for Bayes Error Determination in Binary Classification

A Model-Agnostic Algorithm for Bayes Error Determination in Binary Classification

URL: http://arxiv.org/abs/2107.11609v1
Date: Sat, 24 Jul 2021 13:55:31 GMT
Title: A Model-Agnostic Algorithm for Bayes Error Determination in Binary Classification
Authors: Umberto Michelucci, Michela Sperti, Dario Piga, Francesca Venturini, Marco A. Deriu
Abstract summary: The ILD algorithm is a novel technique to determine the best possible performance, measured in terms of the AUC (area under the ROC curve) and accuracy. This limit, namely the Bayes error, is completely independent of any model used and describes an intrinsic property of the dataset.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper presents the intrinsic limit determination algorithm (ILD Algorithm), a novel technique to determine the best possible performance, measured in terms of the AUC (area under the ROC curve) and accuracy, that can be obtained from a specific dataset in a binary classification problem with categorical features {\sl regardless} of the model used. This limit, namely the Bayes error, is completely independent of any model used and describes an intrinsic property of the dataset. The ILD algorithm thus provides important information regarding the prediction limits of any binary classification algorithm when applied to the considered dataset. In this paper the algorithm is described in detail, its entire mathematical framework is presented and the pseudocode is given to facilitate its implementation. Finally, an example with a real dataset is given.

Related papers

Encoding of data sets and algorithms [0.0]
In many high-impact applications, it is important to ensure the quality of output of a machine learning algorithm. We have initiated a mathematically rigorous theory to decide which models are close to each other in terms of certain metrics. A given threshold metric acting on this grid will express the nearness (or statistical distance) from each algorithm and data set of interest to any given application.
arXiv Detail & Related papers (2023-03-02T05:29:27Z)
Learning to Bound Counterfactual Inference in Structural Causal Models from Observational and Randomised Data [64.96984404868411]
We derive a likelihood characterisation for the overall data that leads us to extend a previous EM-based algorithm. The new algorithm learns to approximate the (unidentifiability) region of model parameters from such mixed data sources. It delivers interval approximations to counterfactual results, which collapse to points in the identifiable case.
arXiv Detail & Related papers (2022-12-06T12:42:11Z)
Dataset Complexity Assessment Based on Cumulative Maximum Scaled Area Under Laplacian Spectrum [38.65823547986758]
It is meaningful to predict classification performance by assessing the complexity of datasets effectively before training DCNN models. This paper proposes a novel method called cumulative maximum scaled Area Under Laplacian Spectrum (cmsAULS)
arXiv Detail & Related papers (2022-09-29T13:02:04Z)
Riemannian classification of EEG signals with missing values [67.90148548467762]
This paper proposes two strategies to handle missing data for the classification of electroencephalograms. The first approach estimates the covariance from imputed data with the $k$-nearest neighbors algorithm; the second relies on the observed data by leveraging the observed-data likelihood within an expectation-maximization algorithm. As results show, the proposed strategies perform better than the classification based on observed data and allow to keep a high accuracy even when the missing data ratio increases.
arXiv Detail & Related papers (2021-10-19T14:24:50Z)
Evaluating State-of-the-Art Classification Models Against Bayes Optimality [106.50867011164584]
We show that we can compute the exact Bayes error of generative models learned using normalizing flows. We use our approach to conduct a thorough investigation of state-of-the-art classification models.
arXiv Detail & Related papers (2021-06-07T06:21:20Z)
Estimating leverage scores via rank revealing methods and randomization [50.591267188664666]
We study algorithms for estimating the statistical leverage scores of rectangular dense or sparse matrices of arbitrary rank. Our approach is based on combining rank revealing methods with compositions of dense and sparse randomized dimensionality reduction transforms.
arXiv Detail & Related papers (2021-05-23T19:21:55Z)
SECODA: Segmentation- and Combination-Based Detection of Anomalies [0.0]
SECODA is an unsupervised non-parametric anomaly detection algorithm for datasets containing continuous and categorical attributes. The algorithm has a low memory imprint and its runtime performance scales linearly with the size of the dataset. An evaluation with simulated and real-life datasets shows that this algorithm is able to identify many different types of anomalies.
arXiv Detail & Related papers (2020-08-16T10:03:14Z)
Finite-sample Analysis of Interpolating Linear Classifiers in the Overparameterized Regime [16.1176305285103]
We prove bounds on the population risk of the maximum margin algorithm for two-class linear classification. We analyze this algorithm applied to random data including misclassification noise.
arXiv Detail & Related papers (2020-04-25T00:06:18Z)
Asymptotic Analysis of an Ensemble of Randomly Projected Linear Discriminants [94.46276668068327]
In [1], an ensemble of randomly projected linear discriminants is used to classify datasets. We develop a consistent estimator of the misclassification probability as an alternative to the computationally-costly cross-validation estimator. We also demonstrate the use of our estimator for tuning the projection dimension on both real and synthetic data.
arXiv Detail & Related papers (2020-04-17T12:47:04Z)
A General Method for Robust Learning from Batches [56.59844655107251]
We consider a general framework of robust learning from batches, and determine the limits of both classification and distribution estimation over arbitrary, including continuous, domains. We derive the first robust computationally-efficient learning algorithms for piecewise-interval classification, and for piecewise-polynomial, monotone, log-concave, and gaussian-mixture distribution estimation.
arXiv Detail & Related papers (2020-02-25T18:53:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.