Related papers: Evaluating unsupervised disentangled representation learning for genomic discovery and disease risk prediction

Evaluating unsupervised disentangled representation learning for genomic discovery and disease risk prediction

URL: http://arxiv.org/abs/2307.08893v1
Date: Mon, 17 Jul 2023 23:28:59 GMT
Title: Evaluating unsupervised disentangled representation learning for genomic discovery and disease risk prediction
Authors: Taedong Yun
Abstract summary: We consider multiple unsupervised learning methods for learning disentangled representations, namely autoencoders, VAE, beta-VAE, and FactorVAE. We observed improvements in the number of genome-wide significant loci, heritability, and performance of polygenic risk scores for asthma and chronic obstructive pulmonary disease by using FactorVAE or beta-VAE.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: High-dimensional clinical data have become invaluable resources for genetic studies, due to their accessibility in biobank-scale datasets and the development of high performance modeling techniques especially using deep learning. Recent work has shown that low dimensional embeddings of these clinical data learned by variational autoencoders (VAE) can be used for genome-wide association studies and polygenic risk prediction. In this work, we consider multiple unsupervised learning methods for learning disentangled representations, namely autoencoders, VAE, beta-VAE, and FactorVAE, in the context of genetic association studies. Using spirograms from UK Biobank as a running example, we observed improvements in the number of genome-wide significant loci, heritability, and performance of polygenic risk scores for asthma and chronic obstructive pulmonary disease by using FactorVAE or beta-VAE, compared to standard VAE or non-variational autoencoders. FactorVAEs performed effectively across multiple values of the regularization hyperparameter, while beta-VAEs were much more sensitive to the hyperparameter values.

Related papers

Adaptable Cardiovascular Disease Risk Prediction from Heterogeneous Data using Large Language Models [70.64969663547703]
AdaCVD is an adaptable CVD risk prediction framework built on large language models extensively fine-tuned on over half a million participants from the UK Biobank.<n>It addresses key clinical challenges across three dimensions: it flexibly incorporates comprehensive yet variable patient information; it seamlessly integrates both structured data and unstructured text; and it rapidly adapts to new patient populations using minimal additional data.
arXiv Detail & Related papers (2025-05-30T14:42:02Z)
SAVAE: Leveraging the variational Bayes autoencoder for survival analysis [10.0060346233449]
We introduce SAVAE (Survival Analysis Variational Autoencoder), a novel approach based on Variational Autoencoders. Savoe contributes significantly to the field by introducing a tailored ELBO formulation for survival analysis. It offers a general method that consistently performs well on various metrics, demonstrating robustness and stability through different experiments.
arXiv Detail & Related papers (2023-12-22T12:36:50Z)
MedDiffusion: Boosting Health Risk Prediction via Diffusion-based Data Augmentation [58.93221876843639]
This paper introduces a novel, end-to-end diffusion-based risk prediction model, named MedDiffusion. It enhances risk prediction performance by creating synthetic patient data during training to enlarge sample space. It discerns hidden relationships between patient visits using a step-wise attention mechanism, enabling the model to automatically retain the most vital information for generating high-quality data.
arXiv Detail & Related papers (2023-10-04T01:36:30Z)
Embracing assay heterogeneity with neural processes for markedly improved bioactivity predictions [0.276240219662896]
Predicting the bioactivity of a ligand is one of the hardest and most important challenges in computer-aided drug discovery. Despite years of data collection and curation efforts, bioactivity data remains sparse and heterogeneous. We present a hierarchical meta-learning framework that exploits the information synergy across disparate assays.
arXiv Detail & Related papers (2023-08-17T16:26:58Z)
Machine Learning Methods for Cancer Classification Using Gene Expression Data: A Review [77.34726150561087]
Cancer is the second major cause of death after cardiovascular diseases. Gene expression can play a fundamental role in the early detection of cancer. This study reviews recent progress in gene expression analysis for cancer classification using machine learning methods.
arXiv Detail & Related papers (2023-01-28T15:03:03Z)
Cancer Subtyping by Improved Transcriptomic Features Using Vector Quantized Variational Autoencoder [10.835673227875615]
We propose Vector Quantized Variational AutoEncoder (VQ-VAE) to tackle the data issues and extract informative latent features that are crucial to the quality of subsequent clustering. VQ-VAE does not impose strict assumptions and hence its latent features are better representations of the input, capable of yielding superior clustering performance with any mainstream clustering method.
arXiv Detail & Related papers (2022-07-20T09:47:53Z)
Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence Classification [109.81283748940696]
We introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio. We show that some simulation-based approaches are more robust (and accurate) than others for specific embedding methods to certain adversarial attacks to the input sequences.
arXiv Detail & Related papers (2022-07-18T19:16:56Z)
Label scarcity in biomedicine: Data-rich latent factor discovery enhances phenotype prediction [102.23901690661916]
Low-dimensional embedding spaces can be derived from the UK Biobank population dataset to enhance data-scarce prediction of health indicators, lifestyle and demographic characteristics. Performances gains from semisupervison approaches will probably become an important ingredient for various medical data science applications.
arXiv Detail & Related papers (2021-10-12T16:25:50Z)
Bootstrapping Your Own Positive Sample: Contrastive Learning With Electronic Health Record Data [62.29031007761901]
This paper proposes a novel contrastive regularized clinical classification model. We introduce two unique positive sampling strategies specifically tailored for EHR data. Our framework yields highly competitive experimental results in predicting the mortality risk on real-world COVID-19 EHR data.
arXiv Detail & Related papers (2021-04-07T06:02:04Z)
Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype Prediction [55.94378672172967]
We focus on few-shot disease subtype prediction problem, identifying subgroups of similar patients. We introduce meta learning techniques to develop a new model, which can extract the common experience or knowledge from interrelated clinical tasks. Our new model is built upon a carefully designed meta-learner, called Prototypical Network, that is a simple yet effective meta learning machine for few-shot image classification.
arXiv Detail & Related papers (2020-09-02T02:50:30Z)
Low-Rank Reorganization via Proportional Hazards Non-negative Matrix Factorization Unveils Survival Associated Gene Clusters [9.773075235189525]
In this work, Cox proportional hazards regression is integrated with NMF by imposing survival constraints. Using human cancer gene expression data, the proposed technique can unravel critical clusters of cancer genes. The discovered gene clusters reflect rich biological implications and can help identify survival-related biomarkers.
arXiv Detail & Related papers (2020-08-09T17:59:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.