Whole Genome Transformer for Gene Interaction Effects in Microbiome Habitat Specificity
- URL: http://arxiv.org/abs/2405.05998v2
- Date: Tue, 28 May 2024 10:59:16 GMT
- Title: Whole Genome Transformer for Gene Interaction Effects in Microbiome Habitat Specificity
- Authors: Zhufeng Li, Sandeep S Cranganore, Nicholas Youngblut, Niki Kilbertus,
- Abstract summary: We propose a framework taking advantage of existing large models for gene vectorization to predict habitat specificity from entire microbial genome sequences.
We train and validate our approach on a large dataset of high quality microbiome genomes from different habitats.
- Score: 3.972930262155919
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Leveraging the vast genetic diversity within microbiomes offers unparalleled insights into complex phenotypes, yet the task of accurately predicting and understanding such traits from genomic data remains challenging. We propose a framework taking advantage of existing large models for gene vectorization to predict habitat specificity from entire microbial genome sequences. Based on our model, we develop attribution techniques to elucidate gene interaction effects that drive microbial adaptation to diverse environments. We train and validate our approach on a large dataset of high quality microbiome genomes from different habitats. We not only demonstrate solid predictive performance, but also how sequence-level information of entire genomes allows us to identify gene associations underlying complex phenotypes. Our attribution recovers known important interaction networks and proposes new candidates for experimental follow up.
Related papers
- Whole-Genome Phenotype Prediction with Machine Learning: Open Problems in Bacterial Genomics [0.8437187555622164]
We set up problems surrounding phenotype prediction from bacterial whole-genome datasets and extend those to learning causal effects.
We discuss challenges that impact the reliability of a machine's decision-making when faced with datasets of this nature.
arXiv Detail & Related papers (2025-02-11T18:25:14Z) - GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.
The model adheres to the central dogma of molecular biology, accurately generating protein-coding sequences.
It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of promoter sequences.
arXiv Detail & Related papers (2025-02-11T05:39:49Z) - Learning to Predict Mutation Effects of Protein-Protein Interactions by Microenvironment-aware Hierarchical Prompt Learning [78.38442423223832]
We develop a novel codebook pre-training task, namely masked microenvironment modeling.
We demonstrate superior performance and training efficiency over state-of-the-art pre-training-based methods in mutation effect prediction.
arXiv Detail & Related papers (2024-05-16T03:53:21Z) - FGBERT: Function-Driven Pre-trained Gene Language Model for Metagenomics [46.189419603576084]
FGBERT is a novel metagenomic pre-trained model that employs a protein-based gene representation as a context-aware tokenizer.
It demonstrates superior performance on metagenomic datasets at four levels, spanning gene, functional, bacterial, and environmental levels.
arXiv Detail & Related papers (2024-02-24T13:13:17Z) - evolSOM: an R Package for evolutionary conservation analysis with SOMs [0.4972323953932129]
We introduce evolSOM, a novel R package that utilizes Self-Organizing Maps (SOMs) to explore and visualize the conservation of biological variables.
The package automatically calculates and graphically presents displacements, enabling efficient comparison and revealing conserved and displaced variables.
Illustratively, we employed evolSOM to study the displacement of genes and phenotypic traits, successfully identifying potential drivers of phenotypic differentiation in grass leaves.
arXiv Detail & Related papers (2024-02-09T20:33:48Z) - Causal machine learning for single-cell genomics [94.28105176231739]
We discuss the application of machine learning techniques to single-cell genomics and their challenges.
We first present the model that underlies most of current causal approaches to single-cell biology.
We then identify open problems in the application of causal approaches to single-cell data.
arXiv Detail & Related papers (2023-10-23T13:35:24Z) - Graph Neural Networks for Microbial Genome Recovery [64.91162205624848]
We propose to use Graph Neural Networks (GNNs) to leverage the assembly graph when learning contig representations for metagenomic binning.
Our method, VaeG-Bin, combines variational autoencoders for learning latent representations of the individual contigs, with GNNs for refining these representations by taking into account the neighborhood structure of the contigs in the assembly graph.
arXiv Detail & Related papers (2022-04-26T12:49:51Z) - Multi-modal Self-supervised Pre-training for Regulatory Genome Across
Cell Types [75.65676405302105]
We propose a simple yet effective approach for pre-training genome data in a multi-modal and self-supervised manner, which we call GeneBERT.
We pre-train our model on the ATAC-seq dataset with 17 million genome sequences.
arXiv Detail & Related papers (2021-10-11T12:48:44Z) - SimpleChrome: Encoding of Combinatorial Effects for Predicting Gene
Expression [8.326669256957352]
We present SimpleChrome, a deep learning model that learns the histone modification representations of genes.
The features learned from the model allow us to better understand the latent effects of cross-gene interactions and direct gene regulation on the target gene expression.
arXiv Detail & Related papers (2020-12-15T23:30:36Z) - A Cross-Level Information Transmission Network for Predicting Phenotype
from New Genotype: Application to Cancer Precision Medicine [37.442717660492384]
We propose a novel Cross-LEvel Information Transmission network (CLEIT) framework.
Inspired by domain adaptation, CLEIT first learns the latent representation of high-level domain then uses it as ground-truth embedding.
We demonstrate the effectiveness and performance boost of CLEIT in predicting anti-cancer drug sensitivity from somatic mutations.
arXiv Detail & Related papers (2020-10-09T22:01:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.