Two-step penalised logistic regression for multi-omic data with an
application to cardiometabolic syndrome
- URL: http://arxiv.org/abs/2008.00235v1
- Date: Sat, 1 Aug 2020 10:36:27 GMT
- Title: Two-step penalised logistic regression for multi-omic data with an
application to cardiometabolic syndrome
- Authors: Alessandra Cabassi, Denis Seyres, Mattia Frontini, Paul D. W. Kirk
- Abstract summary: We implement a two-step approach to multi-omic logistic regression in which variable selection is performed on each layer separately.
Our approach should be preferred if the goal is to select as many relevant predictors as possible.
Our proposed approach allows us to identify features that characterise cardiometabolic syndrome at the molecular level.
- Score: 62.997667081978825
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Building classification models that predict a binary class label on the basis
of high dimensional multi-omics datasets poses several challenges, due to the
typically widely differing characteristics of the data layers in terms of
number of predictors, type of data, and levels of noise. Previous research has
shown that applying classical logistic regression with elastic-net penalty to
these datasets can lead to poor results (Liu et al., 2018). We implement a
two-step approach to multi-omic logistic regression in which variable selection
is performed on each layer separately and a predictive model is then built
using the variables selected in the first step. Here, our approach is compared
to other methods that have been developed for the same purpose, and we adapt
existing software for multi-omic linear regression (Zhao and Zucknick, 2020) to
the logistic regression setting. Extensive simulation studies show that our
approach should be preferred if the goal is to select as many relevant
predictors as possible, as well as achieving prediction performances comparable
to those of the best competitors. Our motivating example is a cardiometabolic
syndrome dataset comprising eight 'omic data types for 2 extreme phenotype
groups (10 obese and 10 lipodystrophy individuals) and 185 blood donors. Our
proposed approach allows us to identify features that characterise
cardiometabolic syndrome at the molecular level. R code is available at
https://github.com/acabassi/logistic-regression-for-multi-omic-data.
Related papers
- Comparative Analysis of Data Preprocessing Methods, Feature Selection
Techniques and Machine Learning Models for Improved Classification and
Regression Performance on Imbalanced Genetic Data [0.0]
We investigated the effects of data preprocessing, feature selection techniques, and model selection on the performance of models trained on genetic datasets.
We found that outliers/skew in predictor or target variables did not pose a challenge to regression models.
We also found that class-imbalanced target variables and skewed predictors had little to no impact on classification performance.
arXiv Detail & Related papers (2024-02-22T21:41:27Z) - A Novel Approach in Solving Stochastic Generalized Linear Regression via
Nonconvex Programming [1.6874375111244329]
This paper considers a generalized linear regression model as a problem with chance constraints.
The results of the proposed algorithm were over 1 to 2 percent better than the ordinary logistic regression model.
arXiv Detail & Related papers (2024-01-16T16:45:51Z) - Bayesian predictive modeling of multi-source multi-way data [0.0]
We consider molecular data from multiple 'omics sources as predictors of early-life iron deficiency (ID) in a rhesus monkey model.
We use a linear model with a low-rank structure on the coefficients to capture multi-way dependence.
We show that our model performs as expected in terms of misclassification rates and correlation of estimated coefficients with true coefficients.
arXiv Detail & Related papers (2022-08-05T21:58:23Z) - Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence
Classification [109.81283748940696]
We introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio.
We show that some simulation-based approaches are more robust (and accurate) than others for specific embedding methods to certain adversarial attacks to the input sequences.
arXiv Detail & Related papers (2022-07-18T19:16:56Z) - Dynamically-Scaled Deep Canonical Correlation Analysis [77.34726150561087]
Canonical Correlation Analysis (CCA) is a method for feature extraction of two views by finding maximally correlated linear projections of them.
We introduce a novel dynamic scaling method for training an input-dependent canonical correlation model.
arXiv Detail & Related papers (2022-03-23T12:52:49Z) - X-model: Improving Data Efficiency in Deep Learning with A Minimax Model [78.55482897452417]
We aim at improving data efficiency for both classification and regression setups in deep learning.
To take the power of both worlds, we propose a novel X-model.
X-model plays a minimax game between the feature extractor and task-specific heads.
arXiv Detail & Related papers (2021-10-09T13:56:48Z) - Flexible Model Aggregation for Quantile Regression [92.63075261170302]
Quantile regression is a fundamental problem in statistical learning motivated by a need to quantify uncertainty in predictions.
We investigate methods for aggregating any number of conditional quantile models.
All of the models we consider in this paper can be fit using modern deep learning toolkits.
arXiv Detail & Related papers (2021-02-26T23:21:16Z) - The MELODIC family for simultaneous binary logistic regression in a
reduced space [0.5330240017302619]
We propose the MELODIC family for simultaneous binary logistic regression modeling.
The model may be interpreted in terms of logistic regression coefficients or in terms of a biplot.
Two applications are shown in detail: one relating personality characteristics to drug consumption profiles and one relating personality characteristics to depressive and anxiety disorders.
arXiv Detail & Related papers (2021-02-16T15:47:20Z) - User-Dependent Neural Sequence Models for Continuous-Time Event Data [27.45413274751265]
Continuous-time event data are common in applications such as individual behavior data, financial transactions, and medical health records.
Recurrent neural networks that parameterize time-varying intensity functions are the current state-of-the-art for predictive modeling with such data.
In this paper, we extend the broad class of neural marked point process models to mixtures of latent embeddings.
arXiv Detail & Related papers (2020-11-06T08:32:57Z) - Robust Finite Mixture Regression for Heterogeneous Targets [70.19798470463378]
We propose an FMR model that finds sample clusters and jointly models multiple incomplete mixed-type targets simultaneously.
We provide non-asymptotic oracle performance bounds for our model under a high-dimensional learning framework.
The results show that our model can achieve state-of-the-art performance.
arXiv Detail & Related papers (2020-10-12T03:27:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.