Improving Omics-Based Classification: The Role of Feature Selection and Synthetic Data Generation
- URL: http://arxiv.org/abs/2505.03387v1
- Date: Tue, 06 May 2025 10:09:50 GMT
- Title: Improving Omics-Based Classification: The Role of Feature Selection and Synthetic Data Generation
- Authors: Diego Perazzolo, Pietro Fanton, Ilaria Barison, Marny Fedrigo, Annalisa Angelini, Chiara Castellani, Enrico Grisan,
- Abstract summary: This study presents a machine learning based classification framework that integrates feature selection with data augmentation techniques.<n>We show that the proposed pipeline yields cross validated perfomance on small dataset.
- Score: 0.18846515534317262
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Given the increasing complexity of omics datasets, a key challenge is not only improving classification performance but also enhancing the transparency and reliability of model decisions. Effective model performance and feature selection are fundamental for explainability and reliability. In many cases, high dimensional omics datasets suffer from limited number of samples due to clinical constraints, patient conditions, phenotypes rarity and others conditions. Current omics based classification models often suffer from narrow interpretability, making it difficult to discern meaningful insights where trust and reproducibility are critical. This study presents a machine learning based classification framework that integrates feature selection with data augmentation techniques to achieve high standard classification accuracy while ensuring better interpretability. Using the publicly available dataset (E MTAB 8026), we explore a bootstrap analysis in six binary classification scenarios to evaluate the proposed model's behaviour. We show that the proposed pipeline yields cross validated perfomance on small dataset that is conserved when the trained classifier is applied to a larger test set. Our findings emphasize the fundamental balance between accuracy and feature selection, highlighting the positive effect of introducing synthetic data for better generalization, even in scenarios with very limited samples availability.
Related papers
- Investigating the Impact of Hard Samples on Accuracy Reveals In-class Data Imbalance [4.291589126905706]
In the AutoML domain, test accuracy is heralded as the quintessential metric for evaluating model efficacy.
However, the reliability of test accuracy as the primary performance metric has been called into question.
The distribution of hard samples between training and test sets affects the difficulty levels of those sets.
We propose a benchmarking procedure for comparing hard sample identification methods.
arXiv Detail & Related papers (2024-09-22T11:38:14Z) - CALICO: Confident Active Learning with Integrated Calibration [11.978551396144532]
We propose an AL framework that self-calibrates the confidence used for sample selection during the training process.
We show improved classification performance compared to a softmax-based classifier with fewer labeled samples.
arXiv Detail & Related papers (2024-07-02T15:05:19Z) - Self-Evolution Learning for Mixup: Enhance Data Augmentation on Few-Shot
Text Classification Tasks [75.42002070547267]
We propose a self evolution learning (SE) based mixup approach for data augmentation in text classification.
We introduce a novel instance specific label smoothing approach, which linearly interpolates the model's output and one hot labels of the original samples to generate new soft for label mixing up.
arXiv Detail & Related papers (2023-05-22T23:43:23Z) - Variational Classification [51.2541371924591]
We derive a variational objective to train the model, analogous to the evidence lower bound (ELBO) used to train variational auto-encoders.
Treating inputs to the softmax layer as samples of a latent variable, our abstracted perspective reveals a potential inconsistency.
We induce a chosen latent distribution, instead of the implicit assumption found in a standard softmax layer.
arXiv Detail & Related papers (2023-05-17T17:47:19Z) - Stabilizing and Improving Federated Learning with Non-IID Data and
Client Dropout [15.569507252445144]
Label distribution skew induced data heterogeniety has been shown to be a significant obstacle that limits the model performance in federated learning.
We propose a simple yet effective framework by introducing a prior-calibrated softmax function for computing the cross-entropy loss.
The improved model performance over existing baselines in the presence of non-IID data and client dropout is demonstrated.
arXiv Detail & Related papers (2023-03-11T05:17:59Z) - Boosting Differentiable Causal Discovery via Adaptive Sample Reweighting [62.23057729112182]
Differentiable score-based causal discovery methods learn a directed acyclic graph from observational data.
We propose a model-agnostic framework to boost causal discovery performance by dynamically learning the adaptive weights for the Reweighted Score function, ReScore.
arXiv Detail & Related papers (2023-03-06T14:49:59Z) - An Epistemic and Aleatoric Decomposition of Arbitrariness to Constrain the Set of Good Models [7.620967781722717]
Recent research reveals that machine learning (ML) models are highly sensitive to minor changes in their training procedure.<n>We show that stability decomposes into epistemic and aleatoric components, capturing the consistency and confidence in prediction.<n>We propose a model selection procedure that includes epistemic and aleatoric criteria alongside existing accuracy and fairness criteria, and show that it successfully narrows down a large set of good models.
arXiv Detail & Related papers (2023-02-09T09:35:36Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - Characterizing Fairness Over the Set of Good Models Under Selective
Labels [69.64662540443162]
We develop a framework for characterizing predictive fairness properties over the set of models that deliver similar overall performance.
We provide tractable algorithms to compute the range of attainable group-level predictive disparities.
We extend our framework to address the empirically relevant challenge of selectively labelled data.
arXiv Detail & Related papers (2021-01-02T02:11:37Z) - Utilizing stability criteria in choosing feature selection methods
yields reproducible results in microbiome data [0.9345224141195311]
We compare the performance of popular model prediction metric MSE and proposed criterion Stability in evaluating four widely used feature selection methods.
We conclude that Stability is a preferred feature selection criterion over MSE because it better quantifies the performance of the feature selection method.
arXiv Detail & Related papers (2020-11-30T22:23:26Z) - RENT -- Repeated Elastic Net Technique for Feature Selection [0.46180371154032895]
We present the Repeated Elastic Net Technique (RENT) for Feature Selection.
RENT uses an ensemble of generalized linear models with elastic net regularization, each trained on distinct subsets of the training data.
RENT provides valuable information for model interpretation concerning the identification of objects in the data that are difficult to predict during training.
arXiv Detail & Related papers (2020-09-27T07:55:52Z) - Causal Feature Selection for Algorithmic Fairness [61.767399505764736]
We consider fairness in the integration component of data management.
We propose an approach to identify a sub-collection of features that ensure the fairness of the dataset.
arXiv Detail & Related papers (2020-06-10T20:20:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.