Feature Selection for Imbalanced Data with Deep Sparse Autoencoders
Ensemble
- URL: http://arxiv.org/abs/2103.11678v1
- Date: Mon, 22 Mar 2021 09:17:08 GMT
- Title: Feature Selection for Imbalanced Data with Deep Sparse Autoencoders
Ensemble
- Authors: Michela C. Massi, Francesca Ieva, Francesca Gasperoni and Anna Maria
Paganoni
- Abstract summary: Class imbalance is a common issue in many domain applications of learning algorithms.
We propose a filtering FS algorithm ranking feature importance on the basis of the Reconstruction Error of a Deep Sparse AutoEncoders Ensemble.
We empirically demonstrate the efficacy of our algorithm in several experiments on high-dimensional datasets of varying sample size.
- Score: 0.5352699766206808
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Class imbalance is a common issue in many domain applications of learning
algorithms. Oftentimes, in the same domains it is much more relevant to
correctly classify and profile minority class observations. This need can be
addressed by Feature Selection (FS), that offers several further advantages,
s.a. decreasing computational costs, aiding inference and interpretability.
However, traditional FS techniques may become sub-optimal in the presence of
strongly imbalanced data. To achieve FS advantages in this setting, we propose
a filtering FS algorithm ranking feature importance on the basis of the
Reconstruction Error of a Deep Sparse AutoEncoders Ensemble (DSAEE). We use
each DSAE trained only on majority class to reconstruct both classes. From the
analysis of the aggregated Reconstruction Error, we determine the features
where the minority class presents a different distribution of values w.r.t. the
overrepresented one, thus identifying the most relevant features to
discriminate between the two. We empirically demonstrate the efficacy of our
algorithm in several experiments on high-dimensional datasets of varying sample
size, showcasing its capability to select relevant and generalizable features
to profile and classify minority class, outperforming other benchmark FS
methods. We also briefly present a real application in radiogenomics, where the
methodology was applied successfully.
Related papers
- Probabilistic Contrastive Learning for Long-Tailed Visual Recognition [78.70453964041718]
Longtailed distributions frequently emerge in real-world data, where a large number of minority categories contain a limited number of samples.
Recent investigations have revealed that supervised contrastive learning exhibits promising potential in alleviating the data imbalance.
We propose a novel probabilistic contrastive (ProCo) learning algorithm that estimates the data distribution of the samples from each class in the feature space.
arXiv Detail & Related papers (2024-03-11T13:44:49Z) - A new computationally efficient algorithm to solve Feature Selection for
Functional Data Classification in high-dimensional spaces [41.79972837288401]
This paper introduces a novel methodology for Feature Selection for Functional Classification, FSFC, that addresses the challenge of jointly performing feature selection and classification of functional data.
We employ functional principal components and develop a new adaptive version of the Dual Augmented Lagrangian algorithm.
The computational efficiency of FSFC enables handling high-dimensional scenarios where the number of features may considerably exceed the number of statistical units.
arXiv Detail & Related papers (2024-01-11T09:17:25Z) - An Upper Bound for the Distribution Overlap Index and Its Applications [18.481370450591317]
This paper proposes an easy-to-compute upper bound for the overlap index between two probability distributions.
The proposed bound shows its value in one-class classification and domain shift analysis.
Our work shows significant promise toward broadening the applications of overlap-based metrics.
arXiv Detail & Related papers (2022-12-16T20:02:03Z) - Explaining Cross-Domain Recognition with Interpretable Deep Classifier [100.63114424262234]
Interpretable Deep (IDC) learns the nearest source samples of a target sample as evidence upon which the classifier makes the decision.
Our IDC leads to a more explainable model with almost no accuracy degradation and effectively calibrates classification for optimum reject options.
arXiv Detail & Related papers (2022-11-15T15:58:56Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - Divide-and-Conquer Hard-thresholding Rules in High-dimensional
Imbalanced Classification [1.0312968200748118]
We study the impact of imbalance class sizes on the linear discriminant analysis (LDA) in high dimensions.
We show that due to data scarcity in one class, referred to as the minority class, the LDA ignores the minority class yielding a maximum misclassification rate.
We propose a new construction of a hard-conquering rule based on a divide-and-conquer technique that reduces the large difference between the misclassification rates.
arXiv Detail & Related papers (2021-11-05T07:44:28Z) - Learning with Multiclass AUC: Theory and Algorithms [141.63211412386283]
Area under the ROC curve (AUC) is a well-known ranking metric for problems such as imbalanced learning and recommender systems.
In this paper, we start an early trial to consider the problem of learning multiclass scoring functions via optimizing multiclass AUC metrics.
arXiv Detail & Related papers (2021-07-28T05:18:10Z) - No Fear of Heterogeneity: Classifier Calibration for Federated Learning
with Non-IID Data [78.69828864672978]
A central challenge in training classification models in the real-world federated system is learning with non-IID data.
We propose a novel and simple algorithm called Virtual Representations (CCVR), which adjusts the classifier using virtual representations sampled from an approximated ssian mixture model.
Experimental results demonstrate that CCVR state-of-the-art performance on popular federated learning benchmarks including CIFAR-10, CIFAR-100, and CINIC-10.
arXiv Detail & Related papers (2021-06-09T12:02:29Z) - SetConv: A New Approach for Learning from Imbalanced Data [29.366843553056594]
We propose a set convolution operation and an episodic training strategy to extract a single representative for each class.
We prove that our proposed algorithm is permutation-invariant despite the order of inputs.
arXiv Detail & Related papers (2021-04-03T22:33:30Z) - Unbiased Subdata Selection for Fair Classification: A Unified Framework
and Scalable Algorithms [0.8376091455761261]
We show that many classification models within this framework can be recast as mixed-integer convex programs.
We then show that in the proposed problem, when the classification outcomes, "unsolvable subdata selection," is strongly-solvable.
This motivates us to develop an iterative refining strategy (IRS) to solve the classification instances.
arXiv Detail & Related papers (2020-12-22T21:09:38Z) - Theoretical Insights Into Multiclass Classification: A High-dimensional
Asymptotic View [82.80085730891126]
We provide the first modernally precise analysis of linear multiclass classification.
Our analysis reveals that the classification accuracy is highly distribution-dependent.
The insights gained may pave the way for a precise understanding of other classification algorithms.
arXiv Detail & Related papers (2020-11-16T05:17:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.