MergeDNA: Context-aware Genome Modeling with Dynamic Tokenization through Token Merging
- URL: http://arxiv.org/abs/2511.14806v1
- Date: Mon, 17 Nov 2025 19:27:41 GMT
- Title: MergeDNA: Context-aware Genome Modeling with Dynamic Tokenization through Token Merging
- Authors: Siyuan Li, Kai Yu, Anna Wang, Zicheng Liu, Chang Yu, Jingbo Zhou, Qirong Yang, Yucheng Guo, Xiaoming Zhang, Stan Z. Li,
- Abstract summary: This paper introduces a hierarchical architecture that jointly optimize a dynamic genomic tokenizer and latent Transformers with context-aware pre-training tasks.<n> MergeDNA achieves superior performance on three popular DNA benchmarks and several multi-omics tasks with fine-tuning or zero-shot evaluation.
- Score: 65.07273789940116
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modeling genomic sequences faces two unsolved challenges: the information density varies widely across different regions, while there is no clearly defined minimum vocabulary unit. Relying on either four primitive bases or independently designed DNA tokenizers, existing approaches with naive masked language modeling pre-training often fail to adapt to the varying complexities of genomic sequences. Leveraging Token Merging techniques, this paper introduces a hierarchical architecture that jointly optimizes a dynamic genomic tokenizer and latent Transformers with context-aware pre-training tasks. As for network structures, the tokenization module automatically chunks adjacent bases into words by stacking multiple layers of the differentiable token merging blocks with local-window constraints, then a Latent Encoder captures the global context of these merged words by full-attention blocks. Symmetrically employing a Latent Decoder and a Local Decoder, MergeDNA learns with two pre-training tasks: Merged Token Reconstruction simultaneously trains the dynamic tokenization module and adaptively filters important tokens, while Adaptive Masked Token Modeling learns to predict these filtered tokens to capture informative contents. Extensive experiments show that MergeDNA achieves superior performance on three popular DNA benchmarks and several multi-omics tasks with fine-tuning or zero-shot evaluation, outperforming typical tokenization methods and large-scale DNA foundation models.
Related papers
- Adaptive Protein Tokenization [0.0]
Existing protein structure tokenizers create tokens by pooling information from local neighborhoods.<n>We present a method for global tokenization of protein structures in which successive tokens contribute increasing levels of detail to a global representation.<n>We show how adaptive tokens enable inference criteria based on information content, which boosts designability.
arXiv Detail & Related papers (2026-02-06T06:15:14Z) - Flow Autoencoders are Effective Protein Tokenizers [0.0]
We present Kanzi, a flow-based tokenizer for tokenization and generation of protein structures.<n>Kanzi consists of a diffusion autoencoder trained with a flow matching loss.<n>We find that these changes stabilize the training of parameter-efficient models that outperform existing tokenizers.
arXiv Detail & Related papers (2025-09-30T23:29:39Z) - LayerCake: Token-Aware Contrastive Decoding within Large Language Model Layers [53.43862310647276]
Large language models (LLMs) excel at natural language understanding and generation but remain vulnerable to factual errors.<n>We introduce a token-aware, layer-localized contrastive decoding method that aligns specific token types with their most influential transformer layers to improve factual generation.<n>Our method requires no additional training or model modification, and experiments demonstrate that our method consistently improves factuality across multiple LLMs and various benchmarks.
arXiv Detail & Related papers (2025-07-06T14:35:43Z) - Training-Free Tokenizer Transplantation via Orthogonal Matching Pursuit [45.18582668677648]
We present a training-free method to transplant tokenizers in large language models.<n>We approximate each out-of-vocabulary token as a sparse linear combination of shared tokens.<n>We show that OMP achieves best zero-shot preservation of the base model's performance.
arXiv Detail & Related papers (2025-06-07T00:51:27Z) - Understanding the Natural Language of DNA using Encoder-Decoder Foundation Models with Byte-level Precision [26.107996342704915]
This paper presents the Ensemble Nucleotide Byte-level-Decoder (ENBED) foundation model, analyzing DNA sequences at byte-level precision with an encoder-decoder Transformer architecture.
We use Masked Language Modeling to pre-train the foundation model using reference genome sequences and apply it in the following downstream tasks.
In each of these tasks, we demonstrate significant improvement as compared to the existing state-of-the-art results.
arXiv Detail & Related papers (2023-11-04T06:00:56Z) - TimeMAE: Self-Supervised Representations of Time Series with Decoupled
Masked Autoencoders [55.00904795497786]
We propose TimeMAE, a novel self-supervised paradigm for learning transferrable time series representations based on transformer networks.
The TimeMAE learns enriched contextual representations of time series with a bidirectional encoding scheme.
To solve the discrepancy issue incurred by newly injected masked embeddings, we design a decoupled autoencoder architecture.
arXiv Detail & Related papers (2023-03-01T08:33:16Z) - Pre-training Co-evolutionary Protein Representation via A Pairwise
Masked Language Model [93.9943278892735]
Key problem in protein sequence representation learning is to capture the co-evolutionary information reflected by the inter-residue co-variation in the sequences.
We propose a novel method to capture this information directly by pre-training via a dedicated language model, i.e., Pairwise Masked Language Model (PMLM)
Our result shows that the proposed method can effectively capture the interresidue correlations and improves the performance of contact prediction by up to 9% compared to the baseline.
arXiv Detail & Related papers (2021-10-29T04:01:32Z) - Improve Variational Autoencoder for Text Generationwith Discrete Latent
Bottleneck [52.08901549360262]
Variational autoencoders (VAEs) are essential tools in end-to-end representation learning.
VAEs tend to ignore latent variables with a strong auto-regressive decoder.
We propose a principled approach to enforce an implicit latent feature matching in a more compact latent space.
arXiv Detail & Related papers (2020-04-22T14:41:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.