EnTao-GPM: DNA Foundation Model for Predicting the Germline Pathogenic Mutations
- URL: http://arxiv.org/abs/2507.21706v1
- Date: Tue, 29 Jul 2025 11:34:41 GMT
- Title: EnTao-GPM: DNA Foundation Model for Predicting the Germline Pathogenic Mutations
- Authors: Zekai Lin, Haoran Sun, Yucheng Guo, Yujie Yang, Yanwen Wang, Bozhen Hu, Chonghang Ye, Qirong Yang, Fan Zhong, Xiaoming Zhang, Lei Liu,
- Abstract summary: Cross-species targeted pre-training on disease-relevant mammalian genomes (human, pig, mouse)<n> Germline mutation specialization via fine-tuning on ClinVar and HGMD.<n>Interpretable clinical framework integrating DNA sequence embeddings with LLM-based statistical explanations.
- Score: 16.32431932781823
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Distinguishing pathogenic mutations from benign polymorphisms remains a critical challenge in precision medicine. EnTao-GPM, developed by Fudan University and BioMap, addresses this through three innovations: (1) Cross-species targeted pre-training on disease-relevant mammalian genomes (human, pig, mouse), leveraging evolutionary conservation to enhance interpretation of pathogenic motifs, particularly in non-coding regions; (2) Germline mutation specialization via fine-tuning on ClinVar and HGMD, improving accuracy for both SNVs and non-SNVs; (3) Interpretable clinical framework integrating DNA sequence embeddings with LLM-based statistical explanations to provide actionable insights. Validated against ClinVar, EnTao-GPM demonstrates superior accuracy in mutation classification. It revolutionizes genetic testing by enabling faster, more accurate, and accessible interpretation for clinical diagnostics (e.g., variant assessment, risk identification, personalized treatment) and research, advancing personalized medicine.
Related papers
- PathGene: Benchmarking Driver Gene Mutations and Exon Prediction Using Multicenter Lung Cancer Histopathology Image Dataset [3.716599571611912]
Accurately predicting gene mutations, mutation subtypes and their exons in lung cancer is critical for personalized treatment planning and prognostic assessment.<n>We have assembled PathGene, which comprises histopathology images paired with next-generation sequencing reports.<n>This multi-center dataset links whole-slide images to driver gene mutation status, mutation subtypes, exon, and tumor mutational burden (TMB) status.
arXiv Detail & Related papers (2025-05-30T11:51:11Z) - A Simple yet Effective DDG Predictor is An Unsupervised Antibody Optimizer and Explainer [53.85265022754878]
We propose a lightweight DDG predictor (Light-DDG) for fast mutation screening.<n>We also release a large-scale dataset containing millions of mutation data for pre-training Light-DDG.<n>For the target antibody, we propose a novel Mutation Explainer to learn mutation preferences.
arXiv Detail & Related papers (2025-02-10T09:26:57Z) - Survey and Improvement Strategies for Gene Prioritization with Large Language Models [61.24568051916653]
Large language models (LLMs) have performed well in medical exams, but their effectiveness in diagnosing rare genetic diseases has not been assessed.<n>We used multi-agent and Human Phenotype Ontology (HPO) classification to categorized patients based on phenotypes and solvability levels.<n>At baseline, GPT-4 outperformed other LLMs, achieving near 30% accuracy in ranking causal genes correctly.
arXiv Detail & Related papers (2025-01-30T23:03:03Z) - Multivariate Feature Selection and Autoencoder Embeddings of Ovarian Cancer Clinical and Genetic Data [2.973561339858947]
This study explores a data-driven approach to discovering novel clinical and genetic markers in ovarian cancer (OC)<n>In the autoencoder analysis, a clearer pattern emerged when using clinical features and the combination of clinical and genetic data.<n>Key clinical variables (such as type of surgery and neoadjuvant chemotherapy) and certain gene mutations showed strong relevance, along with low-risk genetic factors.
arXiv Detail & Related papers (2025-01-27T09:07:07Z) - Integrating Large Language Models for Genetic Variant Classification [12.244115429231888]
Large Language Models (LLMs) have emerged as transformative tools in genetics.
This study investigates the integration of state-of-the-art LLMs, including GPN-MSA, ESM1b, and AlphaMissense.
Our approach evaluates these integrated models using the well-annotated ProteinGym and ClinVar datasets.
arXiv Detail & Related papers (2024-11-07T13:45:56Z) - ALPHAGMUT: A Rationale-Guided Alpha Shape Graph Neural Network to Evaluate Mutation Effects [8.331322657310292]
In this study, we introduce a novel rationale-guided graph neural network AlphaGMut to evaluate mutation effects.
We compute structural-, topological-, biophysical-, and sequence properties of the mutation sites, which are assigned as node attributes in the graph.
We demonstrate that AlphaGMut outperforms state-of-the-art methods, including DeepMind's AlphaMissense, in many performance metrics.
arXiv Detail & Related papers (2024-06-13T14:22:12Z) - Predicting Genetic Mutation from Whole Slide Images via Biomedical-Linguistic Knowledge Enhanced Multi-label Classification [119.13058298388101]
We develop a Biological-knowledge enhanced PathGenomic multi-label Transformer to improve genetic mutation prediction performances.
BPGT first establishes a novel gene encoder that constructs gene priors by two carefully designed modules.
BPGT then designs a label decoder that finally performs genetic mutation prediction by two tailored modules.
arXiv Detail & Related papers (2024-06-05T06:42:27Z) - VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning.
By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z) - Efficient and Scalable Fine-Tune of Language Models for Genome
Understanding [49.606093223945734]
We present textscLingo: textscLanguage prefix ftextscIne-tuning for textscGentextscOmes.
Unlike DNA foundation models, textscLingo strategically leverages natural language foundation models' contextual cues.
textscLingo further accommodates numerous downstream fine-tune tasks by an adaptive rank sampling method.
arXiv Detail & Related papers (2024-02-12T21:40:45Z) - PhyloTransformer: A Discriminative Model for Mutation Prediction Based
on a Multi-head Self-attention Mechanism [10.468453827172477]
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has caused an ongoing pandemic infecting 219 million people as of 10/19/21, with a 3.6% mortality rate.
Here we developed PhyloTransformer, a Transformer-based discriminative model that engages a multi-head self-attention mechanism to model genetic mutations that may lead to viral reproductive advantage.
arXiv Detail & Related papers (2021-11-03T01:30:57Z) - A k-mer Based Approach for SARS-CoV-2 Variant Identification [55.78588835407174]
We show that preserving the order of the amino acids helps the underlying classifiers to achieve better performance.
We also show the importance of the different amino acids which play a key role in identifying variants and how they coincide with those reported by the USA's Centers for Disease Control and Prevention (CDC)
arXiv Detail & Related papers (2021-08-07T15:08:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.