Related papers: Compressing Large Sample Data for Discriminant Analysis

Compressing Large Sample Data for Discriminant Analysis

URL: http://arxiv.org/abs/2005.03858v1
Date: Fri, 8 May 2020 05:09:08 GMT
Title: Compressing Large Sample Data for Discriminant Analysis
Authors: Alexander F. Lapanowski, Irina Gaynanova
Abstract summary: We consider the computational issues due to large sample size within the discriminant analysis framework. We propose a new compression approach for reducing the number of training samples for linear and quadratic discriminant analysis.
Score: 78.12073412066698
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large-sample data became prevalent as data acquisition became cheaper and easier. While a large sample size has theoretical advantages for many statistical methods, it presents computational challenges. Sketching, or compression, is a well-studied approach to address these issues in regression settings, but considerably less is known about its performance in classification settings. Here we consider the computational issues due to large sample size within the discriminant analysis framework. We propose a new compression approach for reducing the number of training samples for linear and quadratic discriminant analysis, in contrast to existing compression methods which focus on reducing the number of features. We support our approach with a theoretical bound on the misclassification error rate compared to the Bayes classifier. Empirical studies confirm the significant computational gains of the proposed method and its superior predictive ability compared to random sub-sampling.

Related papers

Statistical Undersampling with Mutual Information and Support Points [4.118796935183671]
Class imbalance and distributional differences in large datasets present significant challenges for classification tasks machine learning. This work introduces two novel undersampling approaches: mutual information-based stratified simple random sampling and support points optimization.
arXiv Detail & Related papers (2024-12-19T04:48:29Z)
A replica analysis of under-bagging [3.1274367448459253]
Under-bagging (UB) is a popular ensemble learning method for training classifiers on an imbalanced data. Using bagging to reduce the increased variance caused by the reduction in sample size due to under-sampling is a natural approach. It has recently been pointed out that in generalized linear models, naive bagging, which does not consider the class imbalance structure, and ridge regularization can produce the same results.
arXiv Detail & Related papers (2024-04-15T13:31:31Z)
Optimal Multi-Distribution Learning [88.3008613028333]
Multi-distribution learning seeks to learn a shared model that minimizes the worst-case risk across $k$ distinct data distributions. We propose a novel algorithm that yields an varepsilon-optimal randomized hypothesis with a sample complexity on the order of (d+k)/varepsilon2.
arXiv Detail & Related papers (2023-12-08T16:06:29Z)
Collapsed Inference for Bayesian Deep Learning [36.1725075097107]
We introduce a novel collapsed inference scheme that performs Bayesian model averaging using collapsed samples. A collapsed sample represents uncountably many models drawn from the approximate posterior. Our proposed use of collapsed samples achieves a balance between scalability and accuracy.
arXiv Detail & Related papers (2023-06-16T08:34:42Z)
Vector-Valued Least-Squares Regression under Output Regularity Assumptions [73.99064151691597]
We propose and analyse a reduced-rank method for solving least-squares regression problems with infinite dimensional output. We derive learning bounds for our method, and study under which setting statistical performance is improved in comparison to full-rank method.
arXiv Detail & Related papers (2022-11-16T15:07:00Z)
Imbalanced Classification via a Tabular Translation GAN [4.864819846886142]
We present a model based on Generative Adversarial Networks which uses additional regularization losses to map majority samples to corresponding synthetic minority samples. We show that the proposed method improves average precision when compared to alternative re-weighting and oversampling techniques.
arXiv Detail & Related papers (2022-04-19T06:02:53Z)
Heavy-tailed Streaming Statistical Estimation [58.70341336199497]
We consider the task of heavy-tailed statistical estimation given streaming $p$ samples. We design a clipped gradient descent and provide an improved analysis under a more nuanced condition on the noise of gradients.
arXiv Detail & Related papers (2021-08-25T21:30:27Z)
Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers. We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model. Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z)
A Compressive Classification Framework for High-Dimensional Data [12.284934135116515]
We propose a compressive classification framework for settings where the data dimensionality is significantly higher than the sample size. The proposed method, referred to as regularized discriminant analysis (CRDA), is based on linear discriminant analysis. It has the ability to select significant features by using joint-sparsity promoting hard thresholding in the discriminant rule.
arXiv Detail & Related papers (2020-05-09T06:55:00Z)
An Investigation of Why Overparameterization Exacerbates Spurious Correlations [98.3066727301239]
We identify two key properties of the training data that drive this behavior. We show how the inductive bias of models towards "memorizing" fewer examples can cause over parameterization to hurt.
arXiv Detail & Related papers (2020-05-09T01:59:13Z)
Minority Class Oversampling for Tabular Data with Deep Generative Models [4.976007156860967]
We study the ability of deep generative models to provide realistic samples that improve performance on imbalanced classification tasks via oversampling. Our experiments show that the way the method of sampling does not affect quality, but runtime varies widely. We also observe that the improvements in terms of performance metric, while shown to be significant, often are minor in absolute terms.
arXiv Detail & Related papers (2020-05-07T21:35:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.