Beyond Conditional Computation: Retrieval-Augmented Genomic Foundation Models with Gengram
- URL: http://arxiv.org/abs/2601.22203v1
- Date: Thu, 29 Jan 2026 17:43:10 GMT
- Title: Beyond Conditional Computation: Retrieval-Augmented Genomic Foundation Models with Gengram
- Authors: Huinan Xu, Xuyang Feng, Junhong Chen, Junchen Liu, Kaiwen Deng, Kai Ding, Shengning Long, Jiaxue Shuai, Zhaorong Li, Shiping Liu, Guirong Xue, Zhan Xiao,
- Abstract summary: Gengram is a conditional memory module that introduces an explicit and highly efficient lookup primitive for multi-base motifs.<n>It is integrated into the backbone of state-of-the-art genomic foundation models.<n>By establishing structured motif memory as a modeling primitive, Gengram simultaneously boosts empirical performance and mechanistic interpretability.
- Score: 7.122805911264195
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current genomic foundation models (GFMs) rely on extensive neural computation to implicitly approximate conserved biological motifs from single-nucleotide inputs. We propose Gengram, a conditional memory module that introduces an explicit and highly efficient lookup primitive for multi-base motifs via a genomic-specific hashing scheme, establishing genomic "syntax". Integrated into the backbone of state-of-the-art GFMs, Gengram achieves substantial gains (up to 14%) across several functional genomics tasks. The module demonstrates robust architectural generalization, while further inspection of Gengram's latent space reveals the emergence of meaningful representations that align closely with fundamental biological knowledge. By establishing structured motif memory as a modeling primitive, Gengram simultaneously boosts empirical performance and mechanistic interpretability, providing a scalable and biology-aligned pathway for the next generation of GFMs. The code is available at https://github.com/zhejianglab/Genos, and the model checkpoint is available at https://huggingface.co/ZhejiangLab/Gengram.
Related papers
- Gene regulatory network inference algorithm based on spectral signed directed graph convolution [11.166270329149205]
We propose MSGRNLink, a novel framework that explicitly models GRNs as signed directed graphs and employs magnetic signed Laplacian convolution.<n>In a bladder cancer case study, MSGRNLink predicted more known edges and edge signs than benchmark models, further validating its biological relevance.
arXiv Detail & Related papers (2025-12-12T00:54:53Z) - GraphTreeGen: Subtree-Centric Approach to Efficient and Supervised Graph Generation [6.138671548064356]
GraphTreeGen (GTG) is a subtree-centric generative framework for efficient, accurate connectome synthesis.<n>GTG decomposes each connectome into entropy-guided k-hop trees capturing informative local structure.<n>Its modular design enables extensions to connectome super-resolution and cross-modality synthesis.
arXiv Detail & Related papers (2025-08-13T11:02:38Z) - Scalable Graph Generative Modeling via Substructure Sequences [50.32639806800683]
We introduce Generative Graph Pattern Machine (G$2$PM), a generative Transformer pre-training framework for graphs.<n>G$2$PM represents graph instances (nodes, edges, or entire graphs) as sequences of substructures.<n>It employs generative pre-training over the sequences to learn generalizable and transferable representations.
arXiv Detail & Related papers (2025-05-22T02:16:34Z) - OmniGenBench: A Modular Platform for Reproducible Genomic Foundation Models Benchmarking [21.177773831820673]
Genomic Foundation Models (GFMs) have emerged as a transformative approach to decoding the genome.<n>As GFMs scale up and reshape the landscape of AI-driven genomics, the field faces an urgent need for rigorous and reproducible evaluation.<n>We present OmniGenBench, a modular benchmarking platform designed to unify the data, model, benchmarking, and interpretability layers across GFMs.
arXiv Detail & Related papers (2025-05-20T14:16:25Z) - DNAZEN: Enhanced Gene Sequence Representations via Mixed Granularities of Coding Units [18.113659670915474]
Genome modeling conventionally treats gene sequence as a language, reflecting its structured motifs and long-range dependencies analogous to linguistic units and organization principles.<n>We propose DNAZEN, an enhanced genomic representation framework designed to learn from various granularities in gene sequences.<n>A Transformer-based G-gram encoder is also proposed and the matched G-grams are fed into it to compute their representations and integrated into the encoder for basic unit.
arXiv Detail & Related papers (2025-05-04T18:02:28Z) - GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.<n>Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks.<n>It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z) - Beyond Message Passing: Neural Graph Pattern Machine [50.78679002846741]
We introduce the Neural Graph Pattern Machine (GPM), a novel framework that bypasses message passing by learning directly from graph substructures.<n>GPM efficiently extracts, encodes, and prioritizes task-relevant graph patterns, offering greater expressivity and improved ability to capture long-range dependencies.
arXiv Detail & Related papers (2025-01-30T20:37:47Z) - Genomic Interpreter: A Hierarchical Genomic Deep Neural Network with 1D
Shifted Window Transformer [4.059849656394191]
Genomic Interpreter is a novel architecture for genomic assay prediction.
Model can identify hierarchical dependencies in genomic sites.
Evaluated on a dataset containing 38,171 DNA segments of 17K pairs.
arXiv Detail & Related papers (2023-06-08T12:10:13Z) - Dist2Cycle: A Simplicial Neural Network for Homology Localization [66.15805004725809]
Simplicial complexes can be viewed as high dimensional generalizations of graphs that explicitly encode multi-way ordered relations.
We propose a graph convolutional model for learning functions parametrized by the $k$-homological features of simplicial complexes.
arXiv Detail & Related papers (2021-10-28T14:59:41Z) - Infinitely Wide Graph Convolutional Networks: Semi-supervised Learning
via Gaussian Processes [144.6048446370369]
Graph convolutional neural networks(GCNs) have recently demonstrated promising results on graph-based semi-supervised classification.
We propose a GP regression model via GCNs(GPGC) for graph-based semi-supervised learning.
We conduct extensive experiments to evaluate GPGC and demonstrate that it outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2020-02-26T10:02:32Z) - Embedding Graph Auto-Encoder for Graph Clustering [90.8576971748142]
Graph auto-encoder (GAE) models are based on semi-supervised graph convolution networks (GCN)
We design a specific GAE-based model for graph clustering to be consistent with the theory, namely Embedding Graph Auto-Encoder (EGAE)
EGAE consists of one encoder and dual decoders.
arXiv Detail & Related papers (2020-02-20T09:53:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.