Related papers: Hybrid Tokenization Strategy for DNA Language Model using Byte Pair Encoding and K-MER Methods

Hybrid Tokenization Strategy for DNA Language Model using Byte Pair Encoding and K-MER Methods

URL: http://arxiv.org/abs/2507.18570v1
Date: Thu, 24 Jul 2025 16:45:23 GMT
Title: Hybrid Tokenization Strategy for DNA Language Model using Byte Pair Encoding and K-MER Methods
Authors: Ganesh Sapkota, Md Hasibur Rahman,
Abstract summary: Traditional k-mer tokenization is effective at capturing local DNA sequence structures but often faces challenges.<n>We propose merging unique 6mer tokens with selected BPE tokens generated through 600 BPE cycles.<n>This hybrid approach ensures a balanced and context-aware vocabulary, enabling the model to capture both short and long patterns.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper presents a novel hybrid tokenization strategy that enhances the performance of DNA Language Models (DLMs) by combining 6-mer tokenization with Byte Pair Encoding (BPE-600). Traditional k-mer tokenization is effective at capturing local DNA sequence structures but often faces challenges, including uneven token distribution and a limited understanding of global sequence context. To address these limitations, we propose merging unique 6mer tokens with optimally selected BPE tokens generated through 600 BPE cycles. This hybrid approach ensures a balanced and context-aware vocabulary, enabling the model to capture both short and long patterns within DNA sequences simultaneously. A foundational DLM trained on this hybrid vocabulary was evaluated using next-k-mer prediction as a fine-tuning task, demonstrating significantly improved performance. The model achieved prediction accuracies of 10.78% for 3-mers, 10.1% for 4-mers, and 4.12% for 5-mers, outperforming state-of-the-art models such as NT, DNABERT2, and GROVER. These results highlight the ability of the hybrid tokenization strategy to preserve both the local sequence structure and global contextual information in DNA modeling. This work underscores the importance of advanced tokenization methods in genomic language modeling and lays a robust foundation for future applications in downstream DNA sequence analysis and biological research.

Related papers

MergeDNA: Context-aware Genome Modeling with Dynamic Tokenization through Token Merging [65.07273789940116]
This paper introduces a hierarchical architecture that jointly optimize a dynamic genomic tokenizer and latent Transformers with context-aware pre-training tasks.<n> MergeDNA achieves superior performance on three popular DNA benchmarks and several multi-omics tasks with fine-tuning or zero-shot evaluation.
arXiv Detail & Related papers (2025-11-17T19:27:41Z)
BMFM-DNA: A SNP-aware DNA foundation model to capture variant effects [14.172782866715844]
Large language models (LLMs) trained on text demonstrated remarkable results on natural language processing (NLP) tasks.<n>DNA differs fundamentally from natural language, as it lacks clearly defined words or a consistent grammar.<n>We pre-train foundation models that effectively integrate sequence variations, in particular Single Nucleotide Polymorphisms (SNPs)<n>Our findings indicate that integrating sequence variations into DNALMs helps capture the biological functions as seen in improvements on all fine-tuning tasks.
arXiv Detail & Related papers (2025-06-26T13:56:32Z)
JanusDNA: A Powerful Bi-directional Hybrid DNA Foundation Model [1.6128508494592848]
Large language models (LLMs) have revolutionized natural language processing and are increasingly applied to other sequential data types.<n>We introduce JanusDNA, the first bidirectional DNA foundation model built upon a novel pretraining paradigm.<n>JanusDNA processes up to 1 million base pairs at single nucleotide resolution on a single 80GB GPU.
arXiv Detail & Related papers (2025-05-22T20:10:55Z)
When repeats drive the vocabulary: a Byte-Pair Encoding analysis of T2T primate genomes [0.0]
We train independent BPE tokenizers with a fixed vocabulary of 512,000 tokens using our custom tool, dnaBPE.<n>Our analysis reveals that only 11,569 tokens are shared across all assemblies, while nearly 991,854 tokens are unique to a single genome.<n>We discuss potential hybrid strategies and repeat-masking approaches to refine genomic tokenization.
arXiv Detail & Related papers (2025-05-13T19:27:58Z)
HybriDNA: A Hybrid Transformer-Mamba2 Long-Range DNA Language Model [70.69095062674944]
We propose HybriDNA, a decoder-only DNA language model that incorporates a hybrid Transformer-Mamba2 architecture.<n>This hybrid design enables HybriDNA to efficiently process DNA sequences up to 131kb in length with single-nucleotide resolution.<n>HybriDNA achieves state-of-the-art performance across 33 DNA understanding datasets curated from the BEND, GUE, and LRB benchmarks.
arXiv Detail & Related papers (2025-02-15T14:23:43Z)
GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.<n>Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks.<n>It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z)
Model Decides How to Tokenize: Adaptive DNA Sequence Tokenization with MxDNA [44.630039477717624]
MxDNA is a novel framework where the model autonomously learns an effective DNA tokenization strategy through gradient decent.<n>We show that MxDNA learns unique tokenization strategy distinct to those of previous methods and captures genomic functionalities at a token level during self-supervised pretraining.
arXiv Detail & Related papers (2024-12-18T10:55:43Z)
VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning. By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z)
Efficient and Scalable Fine-Tune of Language Models for Genome Understanding [49.606093223945734]
We present textscLingo: textscLanguage prefix ftextscIne-tuning for textscGentextscOmes. Unlike DNA foundation models, textscLingo strategically leverages natural language foundation models' contextual cues. textscLingo further accommodates numerous downstream fine-tune tasks by an adaptive rank sampling method.
arXiv Detail & Related papers (2024-02-12T21:40:45Z)
HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution [76.97231739317259]
We present HyenaDNA, a genomic foundation model pretrained on the human reference genome with context lengths of up to 1 million tokens at the single nucleotide-level. On fine-tuned benchmarks from the Nucleotide Transformer, HyenaDNA reaches state-of-the-art (SotA) on 12 of 18 datasets using a model with orders of magnitude less parameters and pretraining data.
arXiv Detail & Related papers (2023-06-27T20:46:34Z)
DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome [10.051595222470304]
We argue that the computation and sample inefficiencies introduced by k-mer tokenization are primary obstacles in developing large genome foundational models. We provide conceptual and empirical insights into genome tokenization, building on which we propose to replace k-mer tokenization with Byte Pair$. We introduce DNABERT-2, a refined genome foundation model that adapts an efficient tokenizer and employs multiple strategies to overcome input length constraints.
arXiv Detail & Related papers (2023-06-26T18:43:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.