Label scarcity in biomedicine: Data-rich latent factor discovery
enhances phenotype prediction
- URL: http://arxiv.org/abs/2110.06135v1
- Date: Tue, 12 Oct 2021 16:25:50 GMT
- Title: Label scarcity in biomedicine: Data-rich latent factor discovery
enhances phenotype prediction
- Authors: Marc-Andre Schulz, Bertrand Thirion, Alexandre Gramfort, Ga\"el
Varoquaux, Danilo Bzdok
- Abstract summary: Low-dimensional embedding spaces can be derived from the UK Biobank population dataset to enhance data-scarce prediction of health indicators, lifestyle and demographic characteristics.
Performances gains from semisupervison approaches will probably become an important ingredient for various medical data science applications.
- Score: 102.23901690661916
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: High-quality data accumulation is now becoming ubiquitous in the health
domain. There is increasing opportunity to exploit rich data from normal
subjects to improve supervised estimators in specific diseases with notorious
data scarcity. We demonstrate that low-dimensional embedding spaces can be
derived from the UK Biobank population dataset and used to enhance data-scarce
prediction of health indicators, lifestyle and demographic characteristics.
Phenotype predictions facilitated by Variational Autoencoder manifolds
typically scaled better with increasing unlabeled data than dimensionality
reduction by PCA or Isomap. Performances gains from semisupervison approaches
will probably become an important ingredient for various medical data science
applications.
Related papers
- MedDiffusion: Boosting Health Risk Prediction via Diffusion-based Data
Augmentation [58.93221876843639]
This paper introduces a novel, end-to-end diffusion-based risk prediction model, named MedDiffusion.
It enhances risk prediction performance by creating synthetic patient data during training to enlarge sample space.
It discerns hidden relationships between patient visits using a step-wise attention mechanism, enabling the model to automatically retain the most vital information for generating high-quality data.
arXiv Detail & Related papers (2023-10-04T01:36:30Z) - Application of data engineering approaches to address challenges in
microbiome data for optimal medical decision-making [0.0]
The study addresses the issues inherent to microbiome datasets and could be highly beneficial for providing personalized medicine.
The prototype employed in the study addresses the issues inherent to microbiome datasets and could be highly beneficial for providing personalized medicine.
arXiv Detail & Related papers (2023-06-30T05:36:39Z) - An end-to-end framework for gene expression classification by
integrating a background knowledge graph: application to cancer prognosis
prediction [1.5484595752241122]
We proposed an end-to-end framework to handle secondary data to construct a classification model for primary data.
We applied this framework to cancer prognosis prediction using gene expression data and a biological network.
arXiv Detail & Related papers (2023-06-29T11:20:47Z) - Generative models improve fairness of medical classifiers under
distribution shifts [49.10233060774818]
We show that learning realistic augmentations automatically from data is possible in a label-efficient manner using generative models.
We demonstrate that these learned augmentations can surpass ones by making models more robust and statistically fair in- and out-of-distribution.
arXiv Detail & Related papers (2023-04-18T18:15:38Z) - Unsupervised EHR-based Phenotyping via Matrix and Tensor Decompositions [0.6875312133832078]
We provide a comprehensive review of low-rank approximation-based approaches for computational phenotyping.
Recent developments have adapted low-rank data approximation methods by incorporating different constraints and regularizations that facilitate interpretability further.
arXiv Detail & Related papers (2022-09-01T09:47:27Z) - Cancer Subtyping by Improved Transcriptomic Features Using Vector
Quantized Variational Autoencoder [10.835673227875615]
We propose Vector Quantized Variational AutoEncoder (VQ-VAE) to tackle the data issues and extract informative latent features that are crucial to the quality of subsequent clustering.
VQ-VAE does not impose strict assumptions and hence its latent features are better representations of the input, capable of yielding superior clustering performance with any mainstream clustering method.
arXiv Detail & Related papers (2022-07-20T09:47:53Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - SumGNN: Multi-typed Drug Interaction Prediction via Efficient Knowledge
Graph Summarization [64.56399911605286]
We propose SumGNN: knowledge summarization graph neural network, which is enabled by a subgraph extraction module.
SumGNN outperforms the best baseline by up to 5.54%, and the performance gain is particularly significant in low data relation types.
arXiv Detail & Related papers (2020-10-04T00:14:57Z) - Trajectories, bifurcations and pseudotime in large clinical datasets:
applications to myocardial infarction and diabetes data [94.37521840642141]
We suggest a semi-supervised methodology for the analysis of large clinical datasets, characterized by mixed data types and missing values.
The methodology is based on application of elastic principal graphs which can address simultaneously the tasks of dimensionality reduction, data visualization, clustering, feature selection and quantifying the geodesic distances (pseudotime) in partially ordered sequences of observations.
arXiv Detail & Related papers (2020-07-07T21:04:55Z) - Teacher-Student Domain Adaptation for Biosensor Models [0.0]
We present an approach to domain adaptation, addressing the case where data from the source domain is abundant, labelled data from the target domain is limited or non-existent, and a small amount of paired source-target data is available.
The method is designed for developing deep learning models that detect the presence of medical conditions based on data from consumer-grade portable biosensors.
arXiv Detail & Related papers (2020-03-17T19:09:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.