Large-scale benchmark study of survival prediction methods using
multi-omics data
- URL: http://arxiv.org/abs/2003.03621v1
- Date: Sat, 7 Mar 2020 18:03:17 GMT
- Title: Large-scale benchmark study of survival prediction methods using
multi-omics data
- Authors: Moritz Herrmann, Philipp Probst, Roman Hornung, Vindi Jurinovic,
Anne-Laure Boulesteix
- Abstract summary: Questions remain regarding the usefulness of multi-omics data for the prediction of disease outcomes such as survival time.
We aim to give some answers by means of a large-scale benchmark study using real data.
- Score: 2.204918347869259
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-omics data, that is, datasets containing different types of
high-dimensional molecular variables (often in addition to classical clinical
variables), are increasingly generated for the investigation of various
diseases. Nevertheless, questions remain regarding the usefulness of
multi-omics data for the prediction of disease outcomes such as survival time.
It is also unclear which methods are most appropriate to derive such prediction
models. We aim to give some answers to these questions by means of a
large-scale benchmark study using real data. Different prediction methods from
machine learning and statistics were applied on 18 multi-omics cancer datasets
from the database "The Cancer Genome Atlas", containing from 35 to 1,000
observations and from 60,000 to 100,000 variables. The considered outcome was
the (censored) survival time. Twelve methods based on boosting, penalized
regression and random forest were compared, comprising both methods that do and
that do not take the group structure of the omics variables into account. The
Kaplan-Meier estimate and a Cox model using only clinical variables were used
as reference methods. The methods were compared using several repetitions of
5-fold cross-validation. Uno's C-index and the integrated Brier-score served as
performance metrics. The results show that, although multi-omics data can
improve the prediction performance, this is not generally the case. Only the
method block forest slightly outperformed the Cox model on average over all
datasets. Taking into account the multi-omics structure improves the predictive
performance and protects variables in low-dimensional groups - especially
clinical variables - from not being included in the model. All analyses are
reproducible using freely available R code.
Related papers
- Time-to-event prediction for grouped variables using Exclusive Lasso [0.0]
We propose utilizing Exclusive Lasso regularization in place of standard Lasso penalization.
We apply our methodology to a real-life cancer dataset, demonstrating enhanced survival prediction performance compared to the conventional Cox regression model.
arXiv Detail & Related papers (2025-04-02T09:07:05Z) - Multi-CATE: Multi-Accurate Conditional Average Treatment Effect Estimation Robust to Unknown Covariate Shifts [12.289361708127876]
We use methodology for learning multi-accurate predictors to post-process CATE T-learners.
We show how this approach can combine (large) confounded observational and (smaller) randomized datasets.
arXiv Detail & Related papers (2024-05-28T14:12:25Z) - Comparative Analysis of Data Preprocessing Methods, Feature Selection
Techniques and Machine Learning Models for Improved Classification and
Regression Performance on Imbalanced Genetic Data [0.0]
We investigated the effects of data preprocessing, feature selection techniques, and model selection on the performance of models trained on genetic datasets.
We found that outliers/skew in predictor or target variables did not pose a challenge to regression models.
We also found that class-imbalanced target variables and skewed predictors had little to no impact on classification performance.
arXiv Detail & Related papers (2024-02-22T21:41:27Z) - Collinear datasets augmentation using Procrustes validation sets [0.0]
We propose a new method for augmentation of numeric and mixed datasets.
The method generates additional data points by utilizing cross-validation resampling and latent variable modeling.
It is particularly efficient for datasets with moderate to high degrees of collinearity.
arXiv Detail & Related papers (2023-12-08T09:07:11Z) - Kernel Cox partially linear regression: building predictive models for
cancer patients' survival [4.230753712933184]
We build a kernel Cox proportional hazards semi-parametric model and propose a novel regularized garrotized kernel machine (RegGKM) method to fit the model.
We use the kernel machine method to describe the complex relationship between survival and predictors, while automatically removing irrelevant parametric and non-parametric predictors.
Our results can help classify patients into groups with different death risks, facilitating treatment for better clinical outcomes.
arXiv Detail & Related papers (2023-10-11T04:27:54Z) - The effect of data augmentation and 3D-CNN depth on Alzheimer's Disease
detection [51.697248252191265]
This work summarizes and strictly observes best practices regarding data handling, experimental design, and model evaluation.
We focus on Alzheimer's Disease (AD) detection, which serves as a paradigmatic example of challenging problem in healthcare.
Within this framework, we train predictive 15 models, considering three different data augmentation strategies and five distinct 3D CNN architectures.
arXiv Detail & Related papers (2023-09-13T10:40:41Z) - ecpc: An R-package for generic co-data models for high-dimensional
prediction [0.0]
R-package ecpc originally accommodated various and possibly multiple co-data sources.
We present an extension to the method and software for generic co-data models.
We show how ridge penalties may be transformed to elastic net penalties with the R-package squeezy.
arXiv Detail & Related papers (2022-05-16T12:55:19Z) - Combining Observational and Randomized Data for Estimating Heterogeneous
Treatment Effects [82.20189909620899]
Estimating heterogeneous treatment effects is an important problem across many domains.
Currently, most existing works rely exclusively on observational data.
We propose to estimate heterogeneous treatment effects by combining large amounts of observational data and small amounts of randomized data.
arXiv Detail & Related papers (2022-02-25T18:59:54Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - MURAL: An Unsupervised Random Forest-Based Embedding for Electronic
Health Record Data [59.26381272149325]
We present an unsupervised random forest for representing data with disparate variable types.
MURAL forests consist of a set of decision trees where node-splitting variables are chosen at random.
We show that using our approach, we can visualize and classify data more accurately than competing approaches.
arXiv Detail & Related papers (2021-11-19T22:02:21Z) - Flexible Model Aggregation for Quantile Regression [92.63075261170302]
Quantile regression is a fundamental problem in statistical learning motivated by a need to quantify uncertainty in predictions.
We investigate methods for aggregating any number of conditional quantile models.
All of the models we consider in this paper can be fit using modern deep learning toolkits.
arXiv Detail & Related papers (2021-02-26T23:21:16Z) - Tracking disease outbreaks from sparse data with Bayesian inference [55.82986443159948]
The COVID-19 pandemic provides new motivation for estimating the empirical rate of transmission during an outbreak.
Standard methods struggle to accommodate the partial observability and sparse data common at finer scales.
We propose a Bayesian framework which accommodates partial observability in a principled manner.
arXiv Detail & Related papers (2020-09-12T20:37:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.