A Phylogenetic Approach to Genomic Language Modeling
- URL: http://arxiv.org/abs/2503.03773v1
- Date: Tue, 04 Mar 2025 06:53:03 GMT
- Title: A Phylogenetic Approach to Genomic Language Modeling
- Authors: Carlos Albors, Jianan Canal Li, Gonzalo Benegas, Chengzhong Ye, Yun S. Song,
- Abstract summary: We introduce a novel framework for training gLMs that explicitly models nucleotide evolution on phylogenetic trees.<n>Our approach integrates an alignment into the loss function during training but does not require it for making predictions.<n>We applied this framework to train PhyloGPN, a model that excels at predicting functionally disruptive variants from a single sequence alone.
- Score: 0.2912705470788796
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Genomic language models (gLMs) have shown mostly modest success in identifying evolutionarily constrained elements in mammalian genomes. To address this issue, we introduce a novel framework for training gLMs that explicitly models nucleotide evolution on phylogenetic trees using multispecies whole-genome alignments. Our approach integrates an alignment into the loss function during training but does not require it for making predictions, thereby enhancing the model's applicability. We applied this framework to train PhyloGPN, a model that excels at predicting functionally disruptive variants from a single sequence alone and demonstrates strong transfer learning capabilities.
Related papers
- Teaching pathology foundation models to accurately predict gene expression with parameter efficient knowledge transfer [1.5416321520529301]
Efficient Knowledge Adaptation (PEKA) is a novel framework that integrates knowledge distillation and structure alignment losses for cross-modal knowledge transfer.
We evaluated PEKA for gene expression prediction using multiple spatial transcriptomics datasets.
arXiv Detail & Related papers (2025-04-09T17:24:41Z) - UniGenX: Unified Generation of Sequence and Structure with Autoregressive Diffusion [61.690978792873196]
Existing approaches rely on either autoregressive sequence models or diffusion models.
We propose UniGenX, a unified framework that combines autoregressive next-token prediction with conditional diffusion models.
We validate the effectiveness of UniGenX on material and small molecule generation tasks.
arXiv Detail & Related papers (2025-03-09T16:43:07Z) - GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.<n>Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks.<n>It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z) - PhyloGen: Language Model-Enhanced Phylogenetic Inference via Graph Structure Generation [50.80441546742053]
Phylogenetic trees elucidate evolutionary relationships among species.<n>Traditional Markov Chain Monte Carlo methods face slow convergence and computational burdens.<n>We propose PhyloGen, a novel method leveraging a pre-trained genomic language model.
arXiv Detail & Related papers (2024-12-25T08:33:05Z) - Long-range gene expression prediction with token alignment of large language model [37.10820914895689]
We introduce Genetic sequence Token Alignment (GTA), which aligns genetic sequence features with natural language tokens.
GTA learns the regulatory grammar and allows us to further incorporate gene-specific human annotations as prompts.
GTA represents a powerful and novel cross-modal approach to gene expression prediction by utilizing a pretrained language model.
arXiv Detail & Related papers (2024-10-02T02:42:29Z) - VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning.
By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z) - Efficient and Scalable Fine-Tune of Language Models for Genome
Understanding [49.606093223945734]
We present textscLingo: textscLanguage prefix ftextscIne-tuning for textscGentextscOmes.
Unlike DNA foundation models, textscLingo strategically leverages natural language foundation models' contextual cues.
textscLingo further accommodates numerous downstream fine-tune tasks by an adaptive rank sampling method.
arXiv Detail & Related papers (2024-02-12T21:40:45Z) - Unsupervised language models for disease variant prediction [3.6942566104432886]
We find that a single protein LM trained on broad sequence datasets can score pathogenicity for any gene variant zero-shot.
We show that it achieves scoring performance comparable to the state of the art when evaluated on clinically labeled variants of disease-related genes.
arXiv Detail & Related papers (2022-12-07T22:28:13Z) - EvoVGM: A Deep Variational Generative Model for Evolutionary Parameter
Estimation [0.0]
We propose a method for a deep variational Bayesian generative model that jointly approximates the true posterior of local biological evolutionary parameters.
We show the consistency and effectiveness of the method on synthetic sequence alignments simulated with several evolutionary scenarios and on a real virus sequence alignment.
arXiv Detail & Related papers (2022-05-25T20:08:10Z) - Multi-modal Self-supervised Pre-training for Regulatory Genome Across
Cell Types [75.65676405302105]
We propose a simple yet effective approach for pre-training genome data in a multi-modal and self-supervised manner, which we call GeneBERT.
We pre-train our model on the ATAC-seq dataset with 17 million genome sequences.
arXiv Detail & Related papers (2021-10-11T12:48:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.