Minimal Assumptions for Optimal Serology Classification: Theory and
Implications for Multidimensional Settings and Impure Training Data
- URL: http://arxiv.org/abs/2309.00645v1
- Date: Wed, 30 Aug 2023 13:26:49 GMT
- Title: Minimal Assumptions for Optimal Serology Classification: Theory and
Implications for Multidimensional Settings and Impure Training Data
- Authors: Paul N. Patrone, Raquel A. Binder, Catherine S. Forconi, Ann M.
Moormann, Anthony J. Kearsley
- Abstract summary: Minimizing error in prevalence estimates and diagnostic classifiers remains a challenging task in serology.
We propose a technique that uses empirical training data to classify samples and estimate prevalence in arbitrary dimension without direct access to the conditional PDFs.
We validate our methods in the context of synthetic data and a research-use SARS-CoV-2 enzyme-linked immunosorbent (ELISA) assay.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Minimizing error in prevalence estimates and diagnostic classifiers remains a
challenging task in serology. In theory, these problems can be reduced to
modeling class-conditional probability densities (PDFs) of measurement
outcomes, which control all downstream analyses. However, this task quickly
succumbs to the curse of dimensionality, even for assay outputs with only a few
dimensions (e.g. target antigens). To address this problem, we propose a
technique that uses empirical training data to classify samples and estimate
prevalence in arbitrary dimension without direct access to the conditional
PDFs. We motivate this method via a lemma that relates relative conditional
probabilities to minimum-error classification boundaries. This leads us to
formulate an optimization problem that: (i) embeds the data in a parameterized,
curved space; (ii) classifies samples based on their position relative to a
coordinate axis; and (iii) subsequently optimizes the space by minimizing the
empirical classification error of pure training data, for which the classes are
known. Interestingly, the solution to this problem requires use of a
homotopy-type method to stabilize the optimization. We then extend the analysis
to the case of impure training data, for which the classes are unknown. We find
that two impure datasets suffice for both prevalence estimation and
classification, provided they satisfy a linear independence property. Lastly,
we discuss how our analysis unifies discriminative and generative learning
techniques in a common framework based on ideas from set and measure theory.
Throughout, we validate our methods in the context of synthetic data and a
research-use SARS-CoV-2 enzyme-linked immunosorbent (ELISA) assay.
Related papers
- Weighted Missing Linear Discriminant Analysis: An Explainable Approach for Classification with Missing Data [1.4840867281815378]
We propose a novel approach to Linear Discriminant Analysis (LDA) under missing data.
We estimate parameters directly on missing data and use a weight matrix for missing values to penalize missing entries during classification.
Experimental results demonstrate that WLDA outperforms conventional methods by a significant margin.
arXiv Detail & Related papers (2024-06-30T14:21:32Z) - Synergistic eigenanalysis of covariance and Hessian matrices for
enhanced binary classification [75.90957645766676]
We present a novel approach that combines the eigenanalysis of a covariance matrix evaluated on a training set with a Hessian matrix evaluated on a deep learning model.
Our approach is substantiated by formal proofs that establish its capability to maximize between-class mean distance and minimize within-class variances.
arXiv Detail & Related papers (2024-02-14T16:10:42Z) - A cost-sensitive constrained Lasso [2.8265531928694116]
We propose a novel version of the Lasso in which quadratic performance constraints are added to Lasso-based objective functions.
As a result, a constrained sparse regression model is defined by a nonlinear optimization problem.
This cost-sensitive constrained Lasso has a direct application in heterogeneous samples where data are collected from distinct sources.
arXiv Detail & Related papers (2024-01-31T17:36:21Z) - Stabilizing Subject Transfer in EEG Classification with Divergence
Estimation [17.924276728038304]
We propose several graphical models to describe an EEG classification task.
We identify statistical relationships that should hold true in an idealized training scenario.
We design regularization penalties to enforce these relationships in two stages.
arXiv Detail & Related papers (2023-10-12T23:06:52Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - Near-optimal inference in adaptive linear regression [60.08422051718195]
Even simple methods like least squares can exhibit non-normal behavior when data is collected in an adaptive manner.
We propose a family of online debiasing estimators to correct these distributional anomalies in at least squares estimation.
We demonstrate the usefulness of our theory via applications to multi-armed bandit, autoregressive time series estimation, and active learning with exploration.
arXiv Detail & Related papers (2021-07-05T21:05:11Z) - Binary Classification of Gaussian Mixtures: Abundance of Support
Vectors, Benign Overfitting and Regularization [39.35822033674126]
We study binary linear classification under a generative Gaussian mixture model.
We derive novel non-asymptotic bounds on the classification error of the latter.
Our results extend to a noisy model with constant probability noise flips.
arXiv Detail & Related papers (2020-11-18T07:59:55Z) - Theoretical Insights Into Multiclass Classification: A High-dimensional
Asymptotic View [82.80085730891126]
We provide the first modernally precise analysis of linear multiclass classification.
Our analysis reveals that the classification accuracy is highly distribution-dependent.
The insights gained may pave the way for a precise understanding of other classification algorithms.
arXiv Detail & Related papers (2020-11-16T05:17:29Z) - Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers.
We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model.
Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z) - Asymptotic Analysis of an Ensemble of Randomly Projected Linear
Discriminants [94.46276668068327]
In [1], an ensemble of randomly projected linear discriminants is used to classify datasets.
We develop a consistent estimator of the misclassification probability as an alternative to the computationally-costly cross-validation estimator.
We also demonstrate the use of our estimator for tuning the projection dimension on both real and synthetic data.
arXiv Detail & Related papers (2020-04-17T12:47:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.