Related papers: Whole Genome Transformer for Gene Interaction Effects in Microbiome Habitat Specificity

Whole Genome Transformer for Gene Interaction Effects in Microbiome Habitat Specificity

URL: http://arxiv.org/abs/2405.05998v2
Date: Tue, 28 May 2024 10:59:16 GMT
Title: Whole Genome Transformer for Gene Interaction Effects in Microbiome Habitat Specificity
Authors: Zhufeng Li, Sandeep S Cranganore, Nicholas Youngblut, Niki Kilbertus,
Abstract summary: We propose a framework taking advantage of existing large models for gene vectorization to predict habitat specificity from entire microbial genome sequences. We train and validate our approach on a large dataset of high quality microbiome genomes from different habitats.
Score: 3.972930262155919
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Leveraging the vast genetic diversity within microbiomes offers unparalleled insights into complex phenotypes, yet the task of accurately predicting and understanding such traits from genomic data remains challenging. We propose a framework taking advantage of existing large models for gene vectorization to predict habitat specificity from entire microbial genome sequences. Based on our model, we develop attribution techniques to elucidate gene interaction effects that drive microbial adaptation to diverse environments. We train and validate our approach on a large dataset of high quality microbiome genomes from different habitats. We not only demonstrate solid predictive performance, but also how sequence-level information of entire genomes allows us to identify gene associations underlying complex phenotypes. Our attribution recovers known important interaction networks and proposes new candidates for experimental follow up.

Related papers

GRAPE: Heterogeneous Graph Representation Learning for Genetic Perturbation with Coding and Non-Coding Biotype [51.58774936662233]
Building gene regulatory networks (GRN) is essential to understand and predict the effects of genetic perturbations.<n>In this work, we leverage pre-trained large language model and DNA sequence model to extract features from gene descriptions and DNA sequence data.<n>We introduce gene biotype information for the first time in genetic perturbation, simulating the distinct roles of genes with different biotypes in regulating cellular processes.
arXiv Detail & Related papers (2025-05-06T03:35:24Z)
Whole-Genome Phenotype Prediction with Machine Learning: Open Problems in Bacterial Genomics [0.8437187555622164]
We set up problems surrounding phenotype prediction from bacterial whole-genome datasets and extend those to learning causal effects. We discuss challenges that impact the reliability of a machine's decision-making when faced with datasets of this nature.
arXiv Detail & Related papers (2025-02-11T18:25:14Z)
GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters. Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks. It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z)
G2PDiffusion: Cross-Species Genotype-to-Phenotype Prediction via Evolutionary Diffusion [108.94237816552024]
We propose the first genotype-to-phenotype diffusion model (G2PDiffusion) that generates morphological images from DNA. The model contains three novel components: 1) a MSA retrieval engine that identifies conserved and co-evolutionary patterns; 2) an environment-aware MSA conditional encoder that effectively models complex genotype-environment interactions; and 3) an adaptive phenomic alignment module to improve genotype-phenotype consistency.
arXiv Detail & Related papers (2025-02-07T06:16:31Z)
Weighted Diversified Sampling for Efficient Data-Driven Single-Cell Gene-Gene Interaction Discovery [56.622854875204645]
We present an innovative approach utilizing data-driven computational tools, leveraging an advanced Transformer model, to unearth gene-gene interactions. A novel weighted diversified sampling algorithm computes the diversity score of each data sample in just two passes of the dataset.
arXiv Detail & Related papers (2024-10-21T03:35:23Z)
Predicting Genetic Mutation from Whole Slide Images via Biomedical-Linguistic Knowledge Enhanced Multi-label Classification [119.13058298388101]
We develop a Biological-knowledge enhanced PathGenomic multi-label Transformer to improve genetic mutation prediction performances. BPGT first establishes a novel gene encoder that constructs gene priors by two carefully designed modules. BPGT then designs a label decoder that finally performs genetic mutation prediction by two tailored modules.
arXiv Detail & Related papers (2024-06-05T06:42:27Z)
Learning to Predict Mutation Effects of Protein-Protein Interactions by Microenvironment-aware Hierarchical Prompt Learning [78.38442423223832]
We develop a novel codebook pre-training task, namely masked microenvironment modeling. We demonstrate superior performance and training efficiency over state-of-the-art pre-training-based methods in mutation effect prediction.
arXiv Detail & Related papers (2024-05-16T03:53:21Z)
FGBERT: Function-Driven Pre-trained Gene Language Model for Metagenomics [35.47381119898764]
We introduce a protein-based gene representation as a context-aware and structure-relevant tokenizer. MGM and TEM-CL constitute our novel metagenomic language model NAME, pre-trained on 100 million metagenomic sequences.
arXiv Detail & Related papers (2024-02-24T13:13:17Z)
evolSOM: an R Package for evolutionary conservation analysis with SOMs [0.4972323953932129]
We introduce evolSOM, a novel R package that utilizes Self-Organizing Maps (SOMs) to explore and visualize the conservation of biological variables. The package automatically calculates and graphically presents displacements, enabling efficient comparison and revealing conserved and displaced variables. Illustratively, we employed evolSOM to study the displacement of genes and phenotypic traits, successfully identifying potential drivers of phenotypic differentiation in grass leaves.
arXiv Detail & Related papers (2024-02-09T20:33:48Z)
Causal machine learning for single-cell genomics [94.28105176231739]
We discuss the application of machine learning techniques to single-cell genomics and their challenges. We first present the model that underlies most of current causal approaches to single-cell biology. We then identify open problems in the application of causal approaches to single-cell data.
arXiv Detail & Related papers (2023-10-23T13:35:24Z)
Cancer-inspired Genomics Mapper Model for the Generation of Synthetic DNA Sequences with Desired Genomics Signatures [0.0]
Cancer-inspired genomics mapper model (CGMM) combines genetic algorithm (GA) and deep learning (DL) methods. We demonstrate that CGMM can generate synthetic genomes of selected phenotypes such as ancestry and cancer.
arXiv Detail & Related papers (2023-05-01T07:16:40Z)
Graph Neural Networks for Microbial Genome Recovery [64.91162205624848]
We propose to use Graph Neural Networks (GNNs) to leverage the assembly graph when learning contig representations for metagenomic binning. Our method, VaeG-Bin, combines variational autoencoders for learning latent representations of the individual contigs, with GNNs for refining these representations by taking into account the neighborhood structure of the contigs in the assembly graph.
arXiv Detail & Related papers (2022-04-26T12:49:51Z)
Multi-modal Self-supervised Pre-training for Regulatory Genome Across Cell Types [75.65676405302105]
We propose a simple yet effective approach for pre-training genome data in a multi-modal and self-supervised manner, which we call GeneBERT. We pre-train our model on the ATAC-seq dataset with 17 million genome sequences.
arXiv Detail & Related papers (2021-10-11T12:48:44Z)
SimpleChrome: Encoding of Combinatorial Effects for Predicting Gene Expression [8.326669256957352]
We present SimpleChrome, a deep learning model that learns the histone modification representations of genes. The features learned from the model allow us to better understand the latent effects of cross-gene interactions and direct gene regulation on the target gene expression.
arXiv Detail & Related papers (2020-12-15T23:30:36Z)
A Cross-Level Information Transmission Network for Predicting Phenotype from New Genotype: Application to Cancer Precision Medicine [37.442717660492384]
We propose a novel Cross-LEvel Information Transmission network (CLEIT) framework. Inspired by domain adaptation, CLEIT first learns the latent representation of high-level domain then uses it as ground-truth embedding. We demonstrate the effectiveness and performance boost of CLEIT in predicting anti-cancer drug sensitivity from somatic mutations.
arXiv Detail & Related papers (2020-10-09T22:01:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.