Related papers: When repeats drive the vocabulary: a Byte-Pair Encoding analysis of T2T primate genomes

When repeats drive the vocabulary: a Byte-Pair Encoding analysis of T2T primate genomes

URL: http://arxiv.org/abs/2505.08918v1
Date: Tue, 13 May 2025 19:27:58 GMT
Title: When repeats drive the vocabulary: a Byte-Pair Encoding analysis of T2T primate genomes
Authors: Marina Popova, Iaroslav Chelombitko, Aleksey Komissarov,
Abstract summary: We train independent BPE tokenizers with a fixed vocabulary of 512,000 tokens using our custom tool, dnaBPE.<n>Our analysis reveals that only 11,569 tokens are shared across all assemblies, while nearly 991,854 tokens are unique to a single genome.<n>We discuss potential hybrid strategies and repeat-masking approaches to refine genomic tokenization.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The emergence of telomere-to-telomere (T2T) genome assemblies has opened new avenues for comparative genomics, yet effective tokenization strategies for genomic sequences remain underexplored. In this pilot study, we apply Byte Pair Encoding (BPE) to nine T2T primate genomes including three human assemblies by training independent BPE tokenizers with a fixed vocabulary of 512,000 tokens using our custom tool, dnaBPE. Our analysis reveals that only 11,569 tokens are shared across all assemblies, while nearly 991,854 tokens are unique to a single genome, indicating a rapid decline in shared vocabulary with increasing assembly comparisons. Moreover, phylogenetic trees derived from token overlap failed to recapitulate established primate relationships, a discrepancy attributed to the disproportionate influence of species-specific high-copy repetitive elements. These findings underscore the dual nature of BPE tokenization: while it effectively compresses repetitive sequences, its sensitivity to high-copy elements limits its utility as a universal tool for comparative genomics. We discuss potential hybrid strategies and repeat-masking approaches to refine genomic tokenization, emphasizing the need for domain-specific adaptations in the development of large-scale genomic language models. The dnaBPE tool used in this study is open-source and available at https://github.com/aglabx/dnaBPE.

Related papers

Hybrid Tokenization Strategy for DNA Language Model using Byte Pair Encoding and K-MER Methods [0.0]
Traditional k-mer tokenization is effective at capturing local DNA sequence structures but often faces challenges.<n>We propose merging unique 6mer tokens with selected BPE tokens generated through 600 BPE cycles.<n>This hybrid approach ensures a balanced and context-aware vocabulary, enabling the model to capture both short and long patterns.
arXiv Detail & Related papers (2025-07-24T16:45:23Z)
Fast and Scalable Gene Embedding Search: A Comparative Study of FAISS and ScaNN [0.3015442485490762]
Large-scale similarity search is a foundational task in bioinformatics for detecting homology, functional similarity, and novelty among genomic and proteomic sequences.<n>We explore embedding-based similarity search methods that learn latent representations capturing deeper structural and functional patterns beyond raw sequence alignment.<n>Our results highlight both computational advantages (in memory and runtime efficiency) and improved retrieval quality, offering a promising alternative to traditional alignment-heavy tools.
arXiv Detail & Related papers (2025-07-22T19:28:54Z)
evoBPE: Evolutionary Protein Sequence Tokenization [3.4196611972116786]
Current subword tokenization techniques, primarily developed for natural language processing, often fail to represent protein sequences' complex structural and functional properties adequately.<n>This study introduces evoBPE, a novel tokenization approach that integrates evolutionary mutation patterns into sequence segmentation.<n>evoBPE opens new possibilities for machine learning applications in protein function prediction, structural modeling, and evolutionary analysis.
arXiv Detail & Related papers (2025-03-11T19:19:48Z)
GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.<n>Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks.<n>It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z)
VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning. By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z)
Efficient and Scalable Fine-Tune of Language Models for Genome Understanding [49.606093223945734]
We present textscLingo: textscLanguage prefix ftextscIne-tuning for textscGentextscOmes. Unlike DNA foundation models, textscLingo strategically leverages natural language foundation models' contextual cues. textscLingo further accommodates numerous downstream fine-tune tasks by an adaptive rank sampling method.
arXiv Detail & Related papers (2024-02-12T21:40:45Z)
Byte Pair Encoding Is All You Need For Automatic Bengali Speech Recognition [0.0]
Byte pair encoding (BPE) emerges as an effective tokenization method for tackling the out-of-vocabulary (OOV) challenge. Recent research highlights the dependency of BPE subword tokenization's efficacy on the morphological nature of the language. Our study empirically identifies the optimal number of BPE tokens for Bengali, a language known for its morphological complexity.
arXiv Detail & Related papers (2024-01-28T00:41:21Z)
HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution [76.97231739317259]
We present HyenaDNA, a genomic foundation model pretrained on the human reference genome with context lengths of up to 1 million tokens at the single nucleotide-level. On fine-tuned benchmarks from the Nucleotide Transformer, HyenaDNA reaches state-of-the-art (SotA) on 12 of 18 datasets using a model with orders of magnitude less parameters and pretraining data.
arXiv Detail & Related papers (2023-06-27T20:46:34Z)
DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome [10.051595222470304]
We argue that the computation and sample inefficiencies introduced by k-mer tokenization are primary obstacles in developing large genome foundational models. We provide conceptual and empirical insights into genome tokenization, building on which we propose to replace k-mer tokenization with Byte Pair$. We introduce DNABERT-2, a refined genome foundation model that adapts an efficient tokenizer and employs multiple strategies to overcome input length constraints.
arXiv Detail & Related papers (2023-06-26T18:43:46Z)
SNP2Vec: Scalable Self-Supervised Pre-Training for Genome-Wide Association Study [48.75445626157713]
SNP2Vec is a scalable self-supervised pre-training approach for understanding SNP. We apply SNP2Vec to perform long-sequence genomics modeling. We evaluate the effectiveness of our approach on predicting Alzheimer's disease risk in a Chinese cohort.
arXiv Detail & Related papers (2022-04-14T01:53:58Z)
Deep metric learning improves lab of origin prediction of genetically engineered plasmids [63.05016513788047]
Genetic engineering attribution (GEA) is the ability to make sequence-lab associations. We propose a method, based on metric learning, that ranks the most likely labs-of-origin. We are able to extract key signatures in plasmid sequences for particular labs, allowing for an interpretable examination of the model's outputs.
arXiv Detail & Related papers (2021-11-24T16:29:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.