Minimal Assumptions for Optimal Serology Classification: Theory and
Implications for Multidimensional Settings and Impure Training Data
- URL: http://arxiv.org/abs/2309.00645v1
- Date: Wed, 30 Aug 2023 13:26:49 GMT
- Title: Minimal Assumptions for Optimal Serology Classification: Theory and
Implications for Multidimensional Settings and Impure Training Data
- Authors: Paul N. Patrone, Raquel A. Binder, Catherine S. Forconi, Ann M.
Moormann, Anthony J. Kearsley
- Abstract summary: Minimizing error in prevalence estimates and diagnostic classifiers remains a challenging task in serology.
We propose a technique that uses empirical training data to classify samples and estimate prevalence in arbitrary dimension without direct access to the conditional PDFs.
We validate our methods in the context of synthetic data and a research-use SARS-CoV-2 enzyme-linked immunosorbent (ELISA) assay.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Minimizing error in prevalence estimates and diagnostic classifiers remains a
challenging task in serology. In theory, these problems can be reduced to
modeling class-conditional probability densities (PDFs) of measurement
outcomes, which control all downstream analyses. However, this task quickly
succumbs to the curse of dimensionality, even for assay outputs with only a few
dimensions (e.g. target antigens). To address this problem, we propose a
technique that uses empirical training data to classify samples and estimate
prevalence in arbitrary dimension without direct access to the conditional
PDFs. We motivate this method via a lemma that relates relative conditional
probabilities to minimum-error classification boundaries. This leads us to
formulate an optimization problem that: (i) embeds the data in a parameterized,
curved space; (ii) classifies samples based on their position relative to a
coordinate axis; and (iii) subsequently optimizes the space by minimizing the
empirical classification error of pure training data, for which the classes are
known. Interestingly, the solution to this problem requires use of a
homotopy-type method to stabilize the optimization. We then extend the analysis
to the case of impure training data, for which the classes are unknown. We find
that two impure datasets suffice for both prevalence estimation and
classification, provided they satisfy a linear independence property. Lastly,
we discuss how our analysis unifies discriminative and generative learning
techniques in a common framework based on ideas from set and measure theory.
Throughout, we validate our methods in the context of synthetic data and a
research-use SARS-CoV-2 enzyme-linked immunosorbent (ELISA) assay.
Related papers
- Probabilistic Consistency in Machine Learning and Its Connection to Uncertainty Quantification [0.0]
We show that certain types of self-consistent ML models are equivalent to class-conditional probability distributions.<n>This information is sufficient for tasks such as constructing the multiclass Bayes-optimal and estimating inherent uncertainty in the class assignments.
arXiv Detail & Related papers (2025-07-29T10:27:04Z) - Analysis of Diagnostics (Part II): Prevalence, Linear Independence, and Unsupervised Learning [0.0]
Part I considered the context of supervised machine learning (ML)
Part II considers the extent to which these results can be extended to tasks in unsupervised learning.
arXiv Detail & Related papers (2024-08-28T13:39:57Z) - Parametric Classification for Generalized Category Discovery: A Baseline
Study [70.73212959385387]
Generalized Category Discovery (GCD) aims to discover novel categories in unlabelled datasets using knowledge learned from labelled samples.
We investigate the failure of parametric classifiers, verify the effectiveness of previous design choices when high-quality supervision is available, and identify unreliable pseudo-labels as a key problem.
We propose a simple yet effective parametric classification method that benefits from entropy regularisation, achieves state-of-the-art performance on multiple GCD benchmarks and shows strong robustness to unknown class numbers.
arXiv Detail & Related papers (2022-11-21T18:47:11Z) - Multi-Label Quantification [78.83284164605473]
Quantification, variously called "labelled prevalence estimation" or "learning to quantify", is the supervised learning task of generating predictors of the relative frequencies of the classes of interest in unsupervised data samples.
We propose methods for inferring estimators of class prevalence values that strive to leverage the dependencies among the classes of interest in order to predict their relative frequencies more accurately.
arXiv Detail & Related papers (2022-11-15T11:29:59Z) - Statistical Theory for Imbalanced Binary Classification [8.93993657323783]
We show that optimal classification performance depends on certain properties of class imbalance that have not previously been formalized.
Specifically, we propose a novel sub-type of class imbalance, which we call Uniform Class Imbalance.
These results provide some of the first meaningful finite-sample statistical theory for imbalanced binary classification.
arXiv Detail & Related papers (2021-07-05T03:55:43Z) - Constrained Classification and Policy Learning [0.0]
We study consistency of surrogate loss procedures under a constrained set of classifiers.
We show that hinge losses are the only surrogate losses that preserve consistency in second-best scenarios.
arXiv Detail & Related papers (2021-06-24T10:43:00Z) - Learning Gaussian Mixtures with Generalised Linear Models: Precise
Asymptotics in High-dimensions [79.35722941720734]
Generalised linear models for multi-class classification problems are one of the fundamental building blocks of modern machine learning tasks.
We prove exacts characterising the estimator in high-dimensions via empirical risk minimisation.
We discuss how our theory can be applied beyond the scope of synthetic data.
arXiv Detail & Related papers (2021-06-07T16:53:56Z) - Deep Learning in current Neuroimaging: a multivariate approach with
power and type I error control but arguable generalization ability [0.158310730488265]
A non-parametric framework is proposed that estimates the statistical significance of classifications using deep learning architectures.
A label permutation test is proposed in both studies using cross-validation (CV) and resubstitution with upper bound correction (RUB) as validation methods.
We found in the permutation test that CV and RUB methods offer a false positive rate close to the significance level and an acceptable statistical power.
arXiv Detail & Related papers (2021-03-30T21:15:39Z) - Theoretical Insights Into Multiclass Classification: A High-dimensional
Asymptotic View [82.80085730891126]
We provide the first modernally precise analysis of linear multiclass classification.
Our analysis reveals that the classification accuracy is highly distribution-dependent.
The insights gained may pave the way for a precise understanding of other classification algorithms.
arXiv Detail & Related papers (2020-11-16T05:17:29Z) - Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers.
We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model.
Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z) - Learning from Aggregate Observations [82.44304647051243]
We study the problem of learning from aggregate observations where supervision signals are given to sets of instances.
We present a general probabilistic framework that accommodates a variety of aggregate observations.
Simple maximum likelihood solutions can be applied to various differentiable models.
arXiv Detail & Related papers (2020-04-14T06:18:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.