PlantBiMoE: A Bidirectional Foundation Model with SparseMoE for Plant Genomes
- URL: http://arxiv.org/abs/2512.07113v1
- Date: Mon, 08 Dec 2025 02:51:46 GMT
- Title: PlantBiMoE: A Bidirectional Foundation Model with SparseMoE for Plant Genomes
- Authors: Kepeng Lin, Qizhe Zhang, Rui Wang, Xuehai Hu, Wei Xu,
- Abstract summary: PlantBiMoE is a lightweight and expressive plant genome language model.<n>It integrates a bidirectional Mamba and a Sparse Mixture-of-Experts framework.
- Score: 9.805758991551043
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding the underlying linguistic rules of plant genomes remains a fundamental challenge in computational biology. Recent advances including AgroNT and PDLLMs have made notable progress although, they suffer from excessive parameter size and limited ability to model the bidirectional nature of DNA strands respectively. To address these limitations, we propose PlantBiMoE, a lightweight and expressive plant genome language model that integrates bidirectional Mamba and a Sparse Mixture-of-Experts (SparseMoE) framework. The bidirectional Mamba enables the model to effectively capture structural dependencies across both the forward and reverse DNA strands, while SparseMoE significantly reduces the number of active parameters, improving computational efficiency without sacrificing modeling capacity. We evaluated and tested our model on the Modified Plants Genome Benchmark (MPGB), an enhanced genomic benchmark, which consolidates 31 datasets across 11 representative tasks, with input sequence lengths ranging from 50 to 6,000 bp. Experimental results demonstrate that PlantBiMoE achieves the best performance on 20 out of 31 datasets and the average best when comparing with existing models. In summary, all above results demonstrate that our model can effectively represent plant genomic sequences, serving as a robust computational tool for diverse genomic tasks, while making substantive contributions to plant genomics, gene editing, and synthetic biology. The code is available at: https://github.com/HUST-Keep-Lin/PlantBiMoE
Related papers
- MetagenBERT: a Transformer-based Architecture using Foundational genomic Large Language Models for novel Metagenome Representation [4.470992949474734]
We present MetagenBERT, a framework that produces end to end metagenome embeddings directly from raw DNA sequences without taxonomic or functional annotations.<n>We evaluate this approach on five benchmark gut microbiome datasets (Cirrhosis, T2D, Obesity, IBD, CRC)<n>We additionally introduce MetagenBERT Glob Mcardis, a cross cohort variant trained on the large, phenotypically diverse MetaCardis cohort and transferred to other datasets, retaining predictive signal including for unseen phenotypes.
arXiv Detail & Related papers (2026-01-05T19:36:36Z) - PanFoMa: A Lightweight Foundation Model and Benchmark for Pan-Cancer [54.958921946378304]
We introduce PanFoMa, a lightweight hybrid neural network that combines the strengths of Transformers and state-space models.<n>PanFoMa consists of a front-end local-context encoder with shared self-attention layers to capture complex, order-independent gene interactions.<n>We also construct a large-scale pan-cancer single-cell benchmark, PanFoMaBench, containing over 3.5 million high-quality cells.
arXiv Detail & Related papers (2025-12-02T08:31:31Z) - The Quest for Generalizable Motion Generation: Data, Model, and Evaluation [66.57596758773309]
We present a framework that systematically transfers knowledge from ViGen to MoGen across three key pillars: data, modeling, and evaluation.<n>First, we introduce ViMoGen-228K, a large-scale dataset comprising 228,000 high-quality motion samples.<n>Second, we propose ViMoGen, a flow-matching-based diffusion transformer that unifies priors from MoCap data and ViGen models through gated multimodal conditioning.<n>Third, we present MBench, a hierarchical benchmark designed for fine-grained evaluation across motion quality, prompt fidelity, and generalization ability.
arXiv Detail & Related papers (2025-10-30T17:59:27Z) - Same model, better performance: the impact of shuffling on DNA Language Models benchmarking [0.0]
Large Language Models are increasingly popular in genomics due to their potential to decode complex biological sequences.<n>We show that evaluating DNA LMs is a complex task that intersects genomic's domain-specific challenges and machine learning methodologies.<n>We propose a simple solution: pre-shuffling data before storage eliminates hardware dependencies while maintaining efficiency.
arXiv Detail & Related papers (2025-10-14T15:16:56Z) - JanusDNA: A Powerful Bi-directional Hybrid DNA Foundation Model [7.8918969994977575]
Large language models (LLMs) have revolutionized natural language processing and are increasingly applied to other sequential data types.<n>We introduce JanusDNA, the first bidirectional DNA foundation model built upon a novel pretraining paradigm.<n>JanusDNA processes up to 1 million base pairs at single nucleotide resolution on a single 80GB GPU.
arXiv Detail & Related papers (2025-05-22T20:10:55Z) - Bidirectional Mamba for Single-Cell Data: Efficient Context Learning with Biological Fidelity [0.39945675027960637]
We introduce GeneMamba, a scalable and efficient foundation model for single-cell transcriptomics built on state space modeling.<n>GeneMamba captures bidirectional gene context with linear-time complexity, offering substantial computational gains over transformer baselines.<n>We evaluate GeneMamba across diverse tasks, including multi-batch integration, cell type annotation, and gene-gene correlation, demonstrating strong performance, interpretability, and robustness.
arXiv Detail & Related papers (2025-04-22T20:34:47Z) - GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.<n>Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks.<n>It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z) - GenBench: A Benchmarking Suite for Systematic Evaluation of Genomic Foundation Models [56.63218531256961]
We introduce GenBench, a benchmarking suite specifically tailored for evaluating the efficacy of Genomic Foundation Models.
GenBench offers a modular and expandable framework that encapsulates a variety of state-of-the-art methodologies.
We provide a nuanced analysis of the interplay between model architecture and dataset characteristics on task-specific performance.
arXiv Detail & Related papers (2024-06-01T08:01:05Z) - Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling [36.37643634126816]
Long-range token interactions, the effects of upstream and downstream regions of the genome, and the reverse complementarity of DNA are studied.
Here, we propose an architecture motivated by these challenges that builds off the long-range Mamba block.
We use MambaDNA as the basis of Caduceus, the first family of RC equivariant bi-directional long-range DNA language models.
arXiv Detail & Related papers (2024-03-05T01:42:51Z) - Efficient and Scalable Fine-Tune of Language Models for Genome
Understanding [49.606093223945734]
We present textscLingo: textscLanguage prefix ftextscIne-tuning for textscGentextscOmes.
Unlike DNA foundation models, textscLingo strategically leverages natural language foundation models' contextual cues.
textscLingo further accommodates numerous downstream fine-tune tasks by an adaptive rank sampling method.
arXiv Detail & Related papers (2024-02-12T21:40:45Z) - HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide
Resolution [76.97231739317259]
We present HyenaDNA, a genomic foundation model pretrained on the human reference genome with context lengths of up to 1 million tokens at the single nucleotide-level.
On fine-tuned benchmarks from the Nucleotide Transformer, HyenaDNA reaches state-of-the-art (SotA) on 12 of 18 datasets using a model with orders of magnitude less parameters and pretraining data.
arXiv Detail & Related papers (2023-06-27T20:46:34Z) - MoEfication: Conditional Computation of Transformer Models for Efficient
Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost.
We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon.
We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.