DNATokenizer: A GPU-First Byte-to-Identifier Tokenizer for High-Throughput DNA Language Models
- URL: http://arxiv.org/abs/2601.05531v1
- Date: Fri, 09 Jan 2026 05:08:17 GMT
- Title: DNATokenizer: A GPU-First Byte-to-Identifier Tokenizer for High-Throughput DNA Language Models
- Authors: Eliatan Niktab, Hardip Patel,
- Abstract summary: DNATok is a GPU-first tokenization system that replaces general-purpose string processing with byte lookup table (LUT)-based identifier streaming and an overlapped host-to-device (H2D)/compute pipeline.<n>DNATok achieves 84-95x higher encoding throughput than optimized Hugging Face baselines and up to 1.9x higher H2D throughput.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Tokenization sits at the boundary between high-throughput genomic input and GPU compute, posing challenges in both algorithm design and system throughput. Overlapping k-mer tokenization can introduce information leakage under masked language modeling (MLM) and may degrade downstream accuracy. Single-nucleotide tokenization avoids leakage and preserves per-base fidelity, but it greatly increases sequence length for attention-based architectures. Non-overlapping k-mers and byte-pair encoding (BPE) provide compression and avoid leakage, at the cost of boundary sensitivity or reduced interpretability. Empirically, the choice of tokenization interacts strongly with model architecture and task requirements. At the system level, however, standard string tokenizers and host-bound vocabulary lookups dominate wall-clock time once inputs reach billions of bases, regardless of the tokenization algorithm. We present DNATok, a high-performance, GPU-first tokenization system that replaces general-purpose string processing with byte lookup table (LUT)-based identifier streaming and an overlapped host-to-device (H2D)/compute pipeline using pinned memory and architectural parallelism. DNATok is vocabulary-agnostic: it accelerates single-nucleotide, non-overlapping k-mer, and BPE tokenization, and integrates as a drop-in systems layer beneath genomic foundation models. DNATok achieves 84-95x higher encoding throughput than optimized Hugging Face baselines and up to 1.9x higher H2D throughput. End-to-end streaming reaches 1.27-1.84e8 tokens/s depending on configuration, effectively removing tokenization as a bottleneck for production-scale training and inference.
Related papers
- SCONE: A Practical, Constraint-Aware Plug-in for Latent Encoding in Learned DNA Storage [11.92900213512492]
Plug-in module collapses latent compression and DNA encoding into a single step.<n>SCONE performs quaternary arithmetic coding directly on the latent space in DNA bases.<n>Design preserves full reversibility and exploits the hyperprior model's learned priors without modification.
arXiv Detail & Related papers (2026-02-05T19:54:13Z) - ResTok: Learning Hierarchical Residuals in 1D Visual Tokenizers for Autoregressive Image Generation [64.84095852784714]
Residual Tokenizer (ResTok) is a 1D visual tokenizer that builds hierarchical residuals for both image tokens and latent tokens.<n>We show that restoring hierarchical residual priors in visual tokenization significantly improves AR image generation, achieving a gFID of 2.34 on ImageNet-256 with only 9 sampling steps.
arXiv Detail & Related papers (2026-01-07T14:09:18Z) - Trainable Log-linear Sparse Attention for Efficient Diffusion Transformers [36.26426380985327]
Diffusion Transformers (DiTs) set the state of the art in visual generation, yet their quadratic self-attention cost limits scaling to long token sequences.<n>Recent Top-K sparse attention approaches reduce the computation of DiTs by compressing tokens into block-wise representation.<n>We introduce Log-linear Sparse Attention (LLSA), a trainable sparse attention mechanism for extremely long token sequences.
arXiv Detail & Related papers (2025-12-18T14:53:12Z) - GSPN-2: Efficient Parallel Sequence Modeling [101.33780567131716]
Generalized Spatial Propagation Network (GSPN) addresses this by replacing quadratic self-attention with a line-scan propagation scheme.<n>GSPN-2 establishes a new efficiency frontier for modeling global spatial context in vision applications.
arXiv Detail & Related papers (2025-11-28T07:26:45Z) - Near-Lossless Model Compression Enables Longer Context Inference in DNA Large Language Models [8.059385582452112]
FOCUS (Feature-Oriented Compression for Ultra-long Self-attention) is a progressive context-compression module that can be plugged into pretrained DNA LLMs.<n>On held-out human chromosomes, FOCUS achieves near-lossless fidelity.<n>Compared to a baseline without compression, FOCUS reduces KV-cache memory and converts effective inference scaling from O(N2) to near-linear O(N)
arXiv Detail & Related papers (2025-11-18T17:29:39Z) - MergeDNA: Context-aware Genome Modeling with Dynamic Tokenization through Token Merging [65.07273789940116]
This paper introduces a hierarchical architecture that jointly optimize a dynamic genomic tokenizer and latent Transformers with context-aware pre-training tasks.<n> MergeDNA achieves superior performance on three popular DNA benchmarks and several multi-omics tasks with fine-tuning or zero-shot evaluation.
arXiv Detail & Related papers (2025-11-17T19:27:41Z) - Continuous Autoregressive Language Models [56.49239051750678]
We introduce Continuous Autoregressive Language Models (CALM)<n>CALM uses a high-fidelity autoencoder to compress a chunk of K tokens into a single continuous vector.<n>We develop a comprehensive likelihood-free framework that enables robust training, evaluation, and controllable sampling.
arXiv Detail & Related papers (2025-10-31T17:58:11Z) - REAR: Rethinking Visual Autoregressive Models via Generator-Tokenizer Consistency Regularization [130.46612643194973]
reAR is a simple training strategy introducing a token-wise regularization objective.<n>On ImageNet, it reduces gFID from 3.02 to 1.86 and improves IS to 316.9 using a standardization-based tokenizer.<n>When applied to advanced tokenizers, it achieves a gFID of 1.42 with only 177M parameters, matching the performance with larger state-of-the-art diffusion models (675M)
arXiv Detail & Related papers (2025-10-06T02:48:13Z) - BlockBPE: Parallel BPE Tokenization [0.0]
BlockBPE is a parallel GPU implementation of byte-pair encoding (BPE)<n>It achieves near linear-time complexity under realistic assumptions.<n>On high-batch inference workloads, BlockBPE achieves up to 2x higher throughput than tiktoken and 2.5x over HuggingFace Tokenizers.
arXiv Detail & Related papers (2025-07-16T06:12:41Z) - zip2zip: Inference-Time Adaptive Tokenization via Online Compression [27.16551923444618]
zip2zip is a novel method for achieving context-adaptive tokenization in large language models.<n>We show that an existing LLM can be uptrained for zip2zip in 10 GPU-hours via parameter-efficient finetuning.<n>The resulting LLM performs test-time adaptation, learning to use hypertokens in unseen contexts and reducing input and output tokens by 15-40%.
arXiv Detail & Related papers (2025-06-01T17:03:02Z) - Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation [84.00166854547241]
Diffusion Transformers (DiTs) are essential for video generation but suffer from significant latency due to the quadratic complexity of attention.<n>We propose SVG2, a training-free framework that maximizes identification accuracy and computation minimizes waste.
arXiv Detail & Related papers (2025-05-24T21:30:29Z) - CODA: Repurposing Continuous VAEs for Discrete Tokenization [31.932323809073477]
textbfCODA(textbfCOntinuous-to-textbfDiscrete textbfAdaptation) is a framework that decouples compression and discretization.<n>Our approach achieves a remarkable codebook utilization of 100% and notable reconstruction FID (rFID) of $mathbf0.43$ and $mathbf1.34$ for $8 times$ and $16 times$ compression on ImageNet 256$times$ 256 benchmark.
arXiv Detail & Related papers (2025-03-22T12:59:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.