Related papers: Streaming algorithms for evaluating noisy judges on unlabeled data -- binary classification

Streaming algorithms for evaluating noisy judges on unlabeled data -- binary classification

URL: http://arxiv.org/abs/2306.01726v3
Date: Fri, 8 Sep 2023 14:56:36 GMT
Title: Streaming algorithms for evaluating noisy judges on unlabeled data -- binary classification
Authors: Andr\'es Corrada-Emmanuel
Abstract summary: We search for nearly error independent trios by using the algebraic failure modes to reject evaluation ensembles as too correlated. The results produced by the surviving ensembles can sometimes be as good as 1%. A Taylor expansion of the estimates produced when independence is assumed but the classifiers are, in fact, slightly correlated helps clarify how the independent evaluator has algebraic blind spots'
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The evaluation of noisy binary classifiers on unlabeled data is treated as a streaming task: given a data sketch of the decisions by an ensemble, estimate the true prevalence of the labels as well as each classifier's accuracy on them. Two fully algebraic evaluators are constructed to do this. Both are based on the assumption that the classifiers make independent errors. The first is based on majority voting. The second, the main contribution of the paper, is guaranteed to be correct. But how do we know the classifiers are independent on any given test? This principal/agent monitoring paradox is ameliorated by exploiting the failures of the independent evaluator to return sensible estimates. A search for nearly error independent trios is empirically carried out on the \texttt{adult}, \texttt{mushroom}, and \texttt{two-norm} datasets by using the algebraic failure modes to reject evaluation ensembles as too correlated. The searches are refined by constructing a surface in evaluation space that contains the true value point. The algebra of arbitrarily correlated classifiers permits the selection of a polynomial subset free of any correlation variables. Candidate evaluation ensembles are rejected if their data sketches produce independent estimates too far from the constructed surface. The results produced by the surviving ensembles can sometimes be as good as 1\%. But handling even small amounts of correlation remains a challenge. A Taylor expansion of the estimates produced when independence is assumed but the classifiers are, in fact, slightly correlated helps clarify how the independent evaluator has algebraic `blind spots'.

Related papers

Towards Self-Supervised Covariance Estimation in Deep Heteroscedastic Regression [102.24287051757469]
We study self-supervised covariance estimation in deep heteroscedastic regression. We derive an upper bound on the 2-Wasserstein distance between normal distributions. Experiments over a wide range of synthetic and real datasets demonstrate that the proposed 2-Wasserstein bound coupled with pseudo label annotations results in a computationally cheaper yet accurate deep heteroscedastic regression.
arXiv Detail & Related papers (2025-02-14T22:37:11Z)
Evaluating multiple models using labeled and unlabeled data [8.174722982389259]
Semi-Supervised Model Evaluation (SSME) is a method that uses both labeled and unlabeled data to evaluate machine learning classifiers. We present experiments in four domains where obtaining large labeled datasets is often impractical: (1) healthcare, (2) content moderation, (3) molecular property prediction, and (4) image annotation. Our results demonstrate that SSME estimates performance more accurately than do competing methods, reducing error by 5.1x relative to using labeled data alone and 2.4x relative to the next best competing method.
arXiv Detail & Related papers (2025-01-21T03:47:37Z)
Correcting Underrepresentation and Intersectional Bias for Classification [49.1574468325115]
We consider the problem of learning from data corrupted by underrepresentation bias. We show that with a small amount of unbiased data, we can efficiently estimate the group-wise drop-out rates. We show that our algorithm permits efficient learning for model classes of finite VC dimension.
arXiv Detail & Related papers (2023-06-19T18:25:44Z)
Counterfactually Comparing Abstaining Classifiers [37.43975777164451]
Abstaining classifiers have the option to abstain from making predictions on inputs that they are unsure about. We introduce a novel approach to evaluating and comparing abstaining classifiers by treating abstentions as missing data.
arXiv Detail & Related papers (2023-05-17T20:46:57Z)
Label-Noise Learning with Intrinsically Long-Tailed Data [65.41318436799993]
We propose a learning framework for label-noise learning with intrinsically long-tailed data. Specifically, we propose two-stage bi-dimensional sample selection (TABASCO) to better separate clean samples from noisy samples.
arXiv Detail & Related papers (2022-08-21T07:47:05Z)
Learning from Multiple Unlabeled Datasets with Partial Risk Regularization [80.54710259664698]
In this paper, we aim to learn an accurate classifier without any class labels. We first derive an unbiased estimator of the classification risk that can be estimated from the given unlabeled sets. We then find that the classifier obtained as such tends to cause overfitting as its empirical risks go negative during training. Experiments demonstrate that our method effectively mitigates overfitting and outperforms state-of-the-art methods for learning from multiple unlabeled sets.
arXiv Detail & Related papers (2022-07-04T16:22:44Z)
CARMS: Categorical-Antithetic-REINFORCE Multi-Sample Gradient Estimator [60.799183326613395]
We propose an unbiased estimator for categorical random variables based on multiple mutually negatively correlated (jointly antithetic) samples. CARMS combines REINFORCE with copula based sampling to avoid duplicate samples and reduce its variance, while keeping the estimator unbiased using importance sampling. We evaluate CARMS on several benchmark datasets on a generative modeling task, as well as a structured output prediction task, and find it to outperform competing methods including a strong self-control baseline.
arXiv Detail & Related papers (2021-10-26T20:14:30Z)
Specialists Outperform Generalists in Ensemble Classification [15.315432841707736]
In this paper, we address the question of whether we can determine the accuracy of the ensemble. We explicitly construct the individual classifiers that attain the upper and lower bounds: specialists and generalists.
arXiv Detail & Related papers (2021-07-09T12:16:10Z)
Visualizing Classifier Adjacency Relations: A Case Study in Speaker Verification and Voice Anti-Spoofing [72.4445825335561]
We propose a simple method to derive 2D representation from detection scores produced by an arbitrary set of binary classifiers. Based upon rank correlations, our method facilitates a visual comparison of classifiers with arbitrary scores. While the approach is fully versatile and can be applied to any detection task, we demonstrate the method using scores produced by automatic speaker verification and voice anti-spoofing systems.
arXiv Detail & Related papers (2021-06-11T13:03:33Z)
Double Perturbation: On the Robustness of Robustness and Counterfactual Bias Evaluation [109.06060143938052]
We propose a "double perturbation" framework to uncover model weaknesses beyond the test dataset. We apply this framework to study two perturbation-based approaches that are used to analyze models' robustness and counterfactual bias in English.
arXiv Detail & Related papers (2021-04-12T06:57:36Z)
Evaluating Fairness of Machine Learning Models Under Uncertain and Incomplete Information [25.739240011015923]
We show that the test accuracy of the attribute classifier is not always correlated with its effectiveness in bias estimation for a downstream model. Our analysis has surprising and counter-intuitive implications where in certain regimes one might want to distribute the error of the attribute classifier as unevenly as possible.
arXiv Detail & Related papers (2021-02-16T19:02:55Z)
Verifying Individual Fairness in Machine Learning Models [4.29921861868687]
We consider the problem of whether a given decision model, working with structured data, has individual fairness. Our objective is to construct verifiers for proving individual fairness of a given model, and we do so by considering appropriate relaxations of the problem.
arXiv Detail & Related papers (2020-06-21T08:37:54Z)
Classifier-independent Lower-Bounds for Adversarial Robustness [13.247278149124757]
We theoretically analyse the limits of robustness to test-time adversarial and noisy examples in classification. We use optimal transport theory to derive variational formulae for the Bayes-optimal error a classifier can make on a given classification problem. We derive explicit lower-bounds on the Bayes-optimal error in the case of the popular distance-based attacks.
arXiv Detail & Related papers (2020-06-17T16:46:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.