Related papers: A Matrix Variational Auto-Encoder for Variant Effect Prediction in Pharmacogenes

A Matrix Variational Auto-Encoder for Variant Effect Prediction in Pharmacogenes

URL: http://arxiv.org/abs/2507.02624v1
Date: Thu, 03 Jul 2025 13:50:18 GMT
Title: A Matrix Variational Auto-Encoder for Variant Effect Prediction in Pharmacogenes
Authors: Antoine Honoré, Borja Rodríguez Gálvez, Yoomi Park, Yitian Zhou, Volker M. Lauschke, Ming Xiao,
Abstract summary: We propose a transformer-based matrix variational auto-encoder (matVAE) with a structured prior.<n>We evaluate its performance on 33 Deep mutational scanning (DMS) datasets corresponding to 26 drug target and ADME proteins.<n>Our model trained on MSAs (matVAE-MSA) outperforms the state-of-the-art DeepSequence model in zero-shot prediction on DMS datasets.
Score: 9.316811342279955
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Variant effect predictors (VEPs) aim to assess the functional impact of protein variants, traditionally relying on multiple sequence alignments (MSAs). This approach assumes that naturally occurring variants are fit, an assumption challenged by pharmacogenomics, where some pharmacogenes experience low evolutionary pressure. Deep mutational scanning (DMS) datasets provide an alternative by offering quantitative fitness scores for variants. In this work, we propose a transformer-based matrix variational auto-encoder (matVAE) with a structured prior and evaluate its performance on 33 DMS datasets corresponding to 26 drug target and ADME proteins from the ProteinGym benchmark. Our model trained on MSAs (matVAE-MSA) outperforms the state-of-the-art DeepSequence model in zero-shot prediction on DMS datasets, despite using an order of magnitude fewer parameters and requiring less computation at inference time. We also compare matVAE-MSA to matENC-DMS, a model of similar capacity trained on DMS data, and find that the latter performs better on supervised prediction tasks. Additionally, incorporating AlphaFold-generated structures into our transformer model further improves performance, achieving results comparable to DeepSequence trained on MSAs and finetuned on DMS. These findings highlight the potential of DMS datasets to replace MSAs without significant loss in predictive performance, motivating further development of DMS datasets and exploration of their relationships to enhance variant effect prediction.

Related papers

PLAME: Leveraging Pretrained Language Models to Generate Enhanced Protein Multiple Sequence Alignments [53.55710514466851]
Protein structure prediction is essential for drug discovery and understanding biological functions.<n>Most folding models rely heavily on multiple sequence alignments (MSAs) to boost prediction performance.<n>We propose PLAME, a novel MSA design model that leverages evolutionary embeddings from pretrained protein language models.
arXiv Detail & Related papers (2025-06-17T04:11:30Z)
MaskSDM with Shapley values to improve flexibility, robustness, and explainability in species distribution modeling [3.428447509258587]
Species Distribution Models (SDMs) play a vital role in biodiversity research, conservation planning, and ecological niche modeling.<n>We introduce MaskSDM, a novel deep learning-based SDM that enables flexible predictor selection by employing a masked training strategy.<n>We evaluate MaskSDM on the global sPlotOpen dataset, modeling the distributions of 12,738 plant species.
arXiv Detail & Related papers (2025-03-17T11:02:28Z)
Generative Principal Component Regression via Variational Inference [2.4415762506639944]
One approach to designing appropriate manipulations is to target key features of predictive models. We develop a novel objective based on supervised variational autoencoders (SVAEs) that enforces such information is represented in the latent space. We show in simulations that gPCR dramatically improves target selection in manipulation as compared to standard PCR and SVAEs.
arXiv Detail & Related papers (2024-09-03T22:38:55Z)
Geodesic Optimization for Predictive Shift Adaptation on EEG data [53.58711912565724]
Domain adaptation methods struggle when distribution shifts occur simultaneously in $X$ and $y$. This paper proposes a novel method termed Geodesic Optimization for Predictive Shift Adaptation (GOPSA) to address test-time multi-source DA. GOPSA has the potential to combine the advantages of mixed-effects modeling with machine learning for biomedical applications of EEG.
arXiv Detail & Related papers (2024-07-04T12:15:42Z)
Fine-tuning Protein Language Models with Deep Mutational Scanning improves Variant Effect Prediction [3.2358123775807575]
Protein Language Models (PLMs) have emerged as performant and scalable tools for predicting the functional impact and clinical significance of protein-coding variants. We present a novel fine-tuning approach to improve the performance of PLMs with experimental maps of variant effects from Deep Mutational Scanning (DMS) These findings demonstrate that DMS is a promising source of sequence diversity and supervised training data for improving the performance of PLMs for variant effect prediction.
arXiv Detail & Related papers (2024-05-10T14:50:40Z)
AI enhanced data assimilation and uncertainty quantification applied to Geological Carbon Storage [0.0]
We introduce the Surrogate-based hybrid ESMDA (SH-ESMDA), an adaptation of the traditional Ensemble Smoother with Multiple Data Assimilation (ESMDA) We also introduce Surrogate-based Hybrid RML (SH-RML), a variational data assimilation approach that relies on the randomized maximum likelihood (RML) Our comparative analyses show that SH-RML offers better uncertainty compared to conventional ESMDA for the case study.
arXiv Detail & Related papers (2024-02-09T00:24:46Z)
Weakly supervised covariance matrices alignment through Stiefel matrices estimation for MEG applications [64.20396555814513]
This paper introduces a novel domain adaptation technique for time series data, called Mixing model Stiefel Adaptation (MSA) We exploit abundant unlabeled data in the target domain to ensure effective prediction by establishing pairwise correspondence with equivalent signal variances between domains. MSA outperforms recent methods in brain-age regression with task variations using magnetoencephalography (MEG) signals from the Cam-CAN dataset.
arXiv Detail & Related papers (2024-01-24T19:04:49Z)
The effect of data augmentation and 3D-CNN depth on Alzheimer's Disease detection [51.697248252191265]
This work summarizes and strictly observes best practices regarding data handling, experimental design, and model evaluation. We focus on Alzheimer's Disease (AD) detection, which serves as a paradigmatic example of challenging problem in healthcare. Within this framework, we train predictive 15 models, considering three different data augmentation strategies and five distinct 3D CNN architectures.
arXiv Detail & Related papers (2023-09-13T10:40:41Z)
ArSDM: Colonoscopy Images Synthesis with Adaptive Refinement Semantic Diffusion Models [69.9178140563928]
Colonoscopy analysis is essential for assisting clinical diagnosis and treatment. The scarcity of annotated data limits the effectiveness and generalization of existing methods. We propose an Adaptive Refinement Semantic Diffusion Model (ArSDM) to generate colonoscopy images that benefit the downstream tasks.
arXiv Detail & Related papers (2023-09-03T07:55:46Z)
Convolutional Monge Mapping Normalization for learning on sleep data [63.22081662149488]
We propose a new method called Convolutional Monge Mapping Normalization (CMMN) CMMN consists in filtering the signals in order to adapt their power spectrum density (PSD) to a Wasserstein barycenter estimated on training data. Numerical experiments on sleep EEG data show that CMMN leads to significant and consistent performance gains independent from the neural network architecture.
arXiv Detail & Related papers (2023-05-30T08:24:01Z)
Synthetic Health-related Longitudinal Data with Mixed-type Variables Generated using Diffusion Models [2.140861702387444]
This paper presents a novel approach to simulating electronic health records using diffusion probabilistic models (DPMs) We demonstrate the effectiveness of DPMs in synthesising longitudinal EHRs that capture mixed-type variables, including numeric, binary, and categorical variables.
arXiv Detail & Related papers (2023-03-22T03:15:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.