Robust Classification of High-Dimensional Data using Data-Adaptive
Energy Distance
- URL: http://arxiv.org/abs/2306.13985v1
- Date: Sat, 24 Jun 2023 14:39:44 GMT
- Title: Robust Classification of High-Dimensional Data using Data-Adaptive
Energy Distance
- Authors: Jyotishka Ray Choudhury, Aytijhya Saha, Sarbojit Roy, Subhajit Dutta
- Abstract summary: classification of high-dimensional low sample size (HDLSS) data poses a challenge in a variety of real-world situations.
This article presents the development and analysis of some classifiers that are specifically designed for HDLSS data.
It is shown that they yield perfect classification in the HDLSS regime, under some fairly general conditions.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Classification of high-dimensional low sample size (HDLSS) data poses a
challenge in a variety of real-world situations, such as gene expression
studies, cancer research, and medical imaging. This article presents the
development and analysis of some classifiers that are specifically designed for
HDLSS data. These classifiers are free of tuning parameters and are robust, in
the sense that they are devoid of any moment conditions of the underlying data
distributions. It is shown that they yield perfect classification in the HDLSS
asymptotic regime, under some fairly general conditions. The comparative
performance of the proposed classifiers is also investigated. Our theoretical
results are supported by extensive simulation studies and real data analysis,
which demonstrate promising advantages of the proposed classification
techniques over several widely recognized methods.
Related papers
- Exploring Hierarchical Classification Performance for Time Series Data:
Dissimilarity Measures and Classifier Comparisons [0.0]
This study investigates the comparative performance of hierarchical classification (HC) and flat classification (FC) methodologies in time series data analysis.
Dissimilarity measures, including Jensen-Shannon Distance (JSD), Task Similarity Distance (TSD), and Based Distance (CBD) are leveraged.
arXiv Detail & Related papers (2024-02-07T21:46:26Z) - Plugin estimators for selective classification with out-of-distribution
detection [67.28226919253214]
Real-world classifiers can benefit from abstaining from predicting on samples where they have low confidence.
These settings have been the subject of extensive but disjoint study in the selective classification (SC) and out-of-distribution (OOD) detection literature.
Recent work on selective classification with OOD detection has argued for the unified study of these problems.
We propose new plugin estimators for SCOD that are theoretically grounded, effective, and generalise existing approaches.
arXiv Detail & Related papers (2023-01-29T07:45:17Z) - Parametric Classification for Generalized Category Discovery: A Baseline
Study [70.73212959385387]
Generalized Category Discovery (GCD) aims to discover novel categories in unlabelled datasets using knowledge learned from labelled samples.
We investigate the failure of parametric classifiers, verify the effectiveness of previous design choices when high-quality supervision is available, and identify unreliable pseudo-labels as a key problem.
We propose a simple yet effective parametric classification method that benefits from entropy regularisation, achieves state-of-the-art performance on multiple GCD benchmarks and shows strong robustness to unknown class numbers.
arXiv Detail & Related papers (2022-11-21T18:47:11Z) - Cancer Subtyping by Improved Transcriptomic Features Using Vector
Quantized Variational Autoencoder [10.835673227875615]
We propose Vector Quantized Variational AutoEncoder (VQ-VAE) to tackle the data issues and extract informative latent features that are crucial to the quality of subsequent clustering.
VQ-VAE does not impose strict assumptions and hence its latent features are better representations of the input, capable of yielding superior clustering performance with any mainstream clustering method.
arXiv Detail & Related papers (2022-07-20T09:47:53Z) - CAFE: Learning to Condense Dataset by Aligning Features [72.99394941348757]
We propose a novel scheme to Condense dataset by Aligning FEatures (CAFE)
At the heart of our approach is an effective strategy to align features from the real and synthetic data across various scales.
We validate the proposed CAFE across various datasets, and demonstrate that it generally outperforms the state of the art.
arXiv Detail & Related papers (2022-03-03T05:58:49Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - Population structure-learned classifier for high-dimension
low-sample-size class-imbalanced problem [3.411873646414169]
Population Structure-learned classifier (PSC) is proposed.
PSC can obtain better generalization performance on IHDLSS.
PSC is superior to the state-of-art methods in IHDLSS.
arXiv Detail & Related papers (2020-09-10T08:33:39Z) - Solving Long-tailed Recognition with Deep Realistic Taxonomic Classifier [68.38233199030908]
Long-tail recognition tackles the natural non-uniformly distributed data in realworld scenarios.
While moderns perform well on populated classes, its performance degrades significantly on tail classes.
Deep-RTC is proposed as a new solution to the long-tail problem, combining realism with hierarchical predictions.
arXiv Detail & Related papers (2020-07-20T05:57:42Z) - The classification for High-dimension low-sample size data [3.411873646414169]
We propose a novel classification criterion on HDLSS, tolerance, which emphasizes similarity of within-class variance on the premise of class separability.
According to this criterion, a novel linear binary classifier is designed, denoted by No-separated Data Dispersion Maximum (NPDMD)
NPDMD has several characteristics compared to the state-of-the-art classification methods.
arXiv Detail & Related papers (2020-06-21T07:04:16Z) - A Compressive Classification Framework for High-Dimensional Data [12.284934135116515]
We propose a compressive classification framework for settings where the data dimensionality is significantly higher than the sample size.
The proposed method, referred to as regularized discriminant analysis (CRDA), is based on linear discriminant analysis.
It has the ability to select significant features by using joint-sparsity promoting hard thresholding in the discriminant rule.
arXiv Detail & Related papers (2020-05-09T06:55:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.