Evaluating unsupervised disentangled representation learning for genomic
discovery and disease risk prediction
- URL: http://arxiv.org/abs/2307.08893v1
- Date: Mon, 17 Jul 2023 23:28:59 GMT
- Title: Evaluating unsupervised disentangled representation learning for genomic
discovery and disease risk prediction
- Authors: Taedong Yun
- Abstract summary: We consider multiple unsupervised learning methods for learning disentangled representations, namely autoencoders, VAE, beta-VAE, and FactorVAE.
We observed improvements in the number of genome-wide significant loci, heritability, and performance of polygenic risk scores for asthma and chronic obstructive pulmonary disease by using FactorVAE or beta-VAE.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: High-dimensional clinical data have become invaluable resources for genetic
studies, due to their accessibility in biobank-scale datasets and the
development of high performance modeling techniques especially using deep
learning. Recent work has shown that low dimensional embeddings of these
clinical data learned by variational autoencoders (VAE) can be used for
genome-wide association studies and polygenic risk prediction. In this work, we
consider multiple unsupervised learning methods for learning disentangled
representations, namely autoencoders, VAE, beta-VAE, and FactorVAE, in the
context of genetic association studies. Using spirograms from UK Biobank as a
running example, we observed improvements in the number of genome-wide
significant loci, heritability, and performance of polygenic risk scores for
asthma and chronic obstructive pulmonary disease by using FactorVAE or
beta-VAE, compared to standard VAE or non-variational autoencoders. FactorVAEs
performed effectively across multiple values of the regularization
hyperparameter, while beta-VAEs were much more sensitive to the hyperparameter
values.
Related papers
- Adaptable Cardiovascular Disease Risk Prediction from Heterogeneous Data using Large Language Models [70.64969663547703]
AdaCVD is an adaptable CVD risk prediction framework built on large language models extensively fine-tuned on over half a million participants from the UK Biobank.<n>It addresses key clinical challenges across three dimensions: it flexibly incorporates comprehensive yet variable patient information; it seamlessly integrates both structured data and unstructured text; and it rapidly adapts to new patient populations using minimal additional data.
arXiv Detail & Related papers (2025-05-30T14:42:02Z) - SAVAE: Leveraging the variational Bayes autoencoder for survival
analysis [10.0060346233449]
We introduce SAVAE (Survival Analysis Variational Autoencoder), a novel approach based on Variational Autoencoders.
Savoe contributes significantly to the field by introducing a tailored ELBO formulation for survival analysis.
It offers a general method that consistently performs well on various metrics, demonstrating robustness and stability through different experiments.
arXiv Detail & Related papers (2023-12-22T12:36:50Z) - MedDiffusion: Boosting Health Risk Prediction via Diffusion-based Data
Augmentation [58.93221876843639]
This paper introduces a novel, end-to-end diffusion-based risk prediction model, named MedDiffusion.
It enhances risk prediction performance by creating synthetic patient data during training to enlarge sample space.
It discerns hidden relationships between patient visits using a step-wise attention mechanism, enabling the model to automatically retain the most vital information for generating high-quality data.
arXiv Detail & Related papers (2023-10-04T01:36:30Z) - Embracing assay heterogeneity with neural processes for markedly
improved bioactivity predictions [0.276240219662896]
Predicting the bioactivity of a ligand is one of the hardest and most important challenges in computer-aided drug discovery.
Despite years of data collection and curation efforts, bioactivity data remains sparse and heterogeneous.
We present a hierarchical meta-learning framework that exploits the information synergy across disparate assays.
arXiv Detail & Related papers (2023-08-17T16:26:58Z) - Machine Learning Methods for Cancer Classification Using Gene Expression
Data: A Review [77.34726150561087]
Cancer is the second major cause of death after cardiovascular diseases.
Gene expression can play a fundamental role in the early detection of cancer.
This study reviews recent progress in gene expression analysis for cancer classification using machine learning methods.
arXiv Detail & Related papers (2023-01-28T15:03:03Z) - Cancer Subtyping by Improved Transcriptomic Features Using Vector
Quantized Variational Autoencoder [10.835673227875615]
We propose Vector Quantized Variational AutoEncoder (VQ-VAE) to tackle the data issues and extract informative latent features that are crucial to the quality of subsequent clustering.
VQ-VAE does not impose strict assumptions and hence its latent features are better representations of the input, capable of yielding superior clustering performance with any mainstream clustering method.
arXiv Detail & Related papers (2022-07-20T09:47:53Z) - Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence
Classification [109.81283748940696]
We introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio.
We show that some simulation-based approaches are more robust (and accurate) than others for specific embedding methods to certain adversarial attacks to the input sequences.
arXiv Detail & Related papers (2022-07-18T19:16:56Z) - Label scarcity in biomedicine: Data-rich latent factor discovery
enhances phenotype prediction [102.23901690661916]
Low-dimensional embedding spaces can be derived from the UK Biobank population dataset to enhance data-scarce prediction of health indicators, lifestyle and demographic characteristics.
Performances gains from semisupervison approaches will probably become an important ingredient for various medical data science applications.
arXiv Detail & Related papers (2021-10-12T16:25:50Z) - Bootstrapping Your Own Positive Sample: Contrastive Learning With
Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model.
We introduce two unique positive sampling strategies specifically tailored for EHR data.
Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z) - Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype
Prediction [55.94378672172967]
We focus on few-shot disease subtype prediction problem, identifying subgroups of similar patients.
We introduce meta learning techniques to develop a new model, which can extract the common experience or knowledge from interrelated clinical tasks.
Our new model is built upon a carefully designed meta-learner, called Prototypical Network, that is a simple yet effective meta learning machine for few-shot image classification.
arXiv Detail & Related papers (2020-09-02T02:50:30Z) - Low-Rank Reorganization via Proportional Hazards Non-negative Matrix
Factorization Unveils Survival Associated Gene Clusters [9.773075235189525]
In this work, Cox proportional hazards regression is integrated with NMF by imposing survival constraints.
Using human cancer gene expression data, the proposed technique can unravel critical clusters of cancer genes.
The discovered gene clusters reflect rich biological implications and can help identify survival-related biomarkers.
arXiv Detail & Related papers (2020-08-09T17:59:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.