SupraTok: Cross-Boundary Tokenization for Enhanced Language Model Performance
- URL: http://arxiv.org/abs/2508.11857v2
- Date: Mon, 25 Aug 2025 13:30:15 GMT
- Title: SupraTok: Cross-Boundary Tokenization for Enhanced Language Model Performance
- Authors: Andrei-Valentin Tănase, Elena Pelican,
- Abstract summary: Tokenization remains a fundamental yet underexplored bottleneck in natural language processing.<n>We present SupraTok, a novel tokenization architecture that reimagines subword segmentation.<n>Our approach achieves 31% improvement in English tokenization efficiency.
- Score: 1.9336815376402718
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Tokenization remains a fundamental yet underexplored bottleneck in natural language processing, with strategies largely static despite remarkable progress in model architectures. We present SupraTok, a novel tokenization architecture that reimagines subword segmentation through three innovations: cross-boundary pattern learning that discovers multi-word semantic units, entropy-driven data curation that optimizes training corpus quality, and multi-phase curriculum learning for stable convergence. Our approach extends Byte-Pair Encoding by learning "superword" tokens, coherent multi-word expressions that preserve semantic unity while maximizing compression efficiency. SupraTok achieves 31% improvement in English tokenization efficiency (5.91 versus 4.51 characters per token) compared to OpenAI's o200k tokenizer and 30% improvement over Google's Gemma 3 tokenizer (256k vocabulary), while maintaining competitive performance across 38 languages. When integrated with a GPT-2 scale model (124M parameters) trained on 10 billion tokens from the FineWeb-Edu dataset, SupraTok yields 8.4% improvement on HellaSWAG and 9.5% on MMLU benchmarks without architectural modifications. While these results are promising at this scale, further validation at larger model scales is needed. These findings suggest that efficient tokenization can complement architectural innovations as a path to improved language model performance.
Related papers
- MergeDNA: Context-aware Genome Modeling with Dynamic Tokenization through Token Merging [65.07273789940116]
This paper introduces a hierarchical architecture that jointly optimize a dynamic genomic tokenizer and latent Transformers with context-aware pre-training tasks.<n> MergeDNA achieves superior performance on three popular DNA benchmarks and several multi-omics tasks with fine-tuning or zero-shot evaluation.
arXiv Detail & Related papers (2025-11-17T19:27:41Z) - IndicSuperTokenizer: An Optimized Tokenizer for Indic Multilingual LLMs [5.068673710249497]
IndicSuperTokenizer is a tokenizer for Indic multilingual LLMs.<n>It combines subword and multi-word tokenization, along with language-specific tokens pre-tokenization.<n>It improves the average fertility score by 39.5% over LLaMA4 and by 18% over Sutra.
arXiv Detail & Related papers (2025-11-05T06:57:42Z) - CS3-Bench: Evaluating and Enhancing Speech-to-Speech LLMs for Mandarin-English Code-Switching [31.584937435966253]
We propose Code-Switching Speech-to-Speech Benchmark (CS3-Bench)<n>Experiments on 7 mainstream models demonstrate a relative performance drop of up to 66% in knowledge-intensive question answering.<n>Our approach improves the knowledge accuracy from 25.14% to 46.13%, with open-ended understanding rate from 64.5% to 86.5%, and significantly reduces pronunciation errors in the secondary language.
arXiv Detail & Related papers (2025-10-09T07:34:23Z) - The Art of Breaking Words: Rethinking Multilingual Tokenizer Design [21.9940001977516]
Existing tokenizers exhibit high token-to-word ratios, inefficient use of context length, and slower inference.<n>We present a systematic study that links vocabulary size, pre-tokenization rules, and training-corpus composition to both token-to-word efficiency and model quality.<n>Our tokenizer achieves more than 40% improvement on average token-to-word ratio against stateof-the-art multilingual Indic models.
arXiv Detail & Related papers (2025-08-03T15:31:10Z) - Improving Contextual ASR via Multi-grained Fusion with Large Language Models [12.755830619473368]
We propose a novel multi-grained fusion approach that jointly leverages the strengths of both token-level and phrase-level fusion with Large Language Models (LLMs)<n>Our approach incorporates a late-fusion strategy that combines ASR's acoustic information with LLM's rich contextual knowledge, balancing fine-grained token precision with holistic phrase-level understanding.<n> Experiments on Chinese and English datasets demonstrate that our approach achieves state-of-the-art performance on keyword-related metrics.
arXiv Detail & Related papers (2025-07-16T13:59:32Z) - Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations [83.93566096400723]
We find that instruction-tuned models retain up to 93.4% of their original performance when given a randomly sampled tokenization.<n>Character-level segmentation improves string manipulation and code understanding tasks by up to +14%.<n>Right-aligned digit grouping enhances large-number arithmetic by +33%.
arXiv Detail & Related papers (2025-06-23T18:02:26Z) - Enhancing LLM Language Adaption through Cross-lingual In-Context Pre-training [57.62126373849383]
Cross-lingual In-context Pre-training (CrossIC-PT) is a simple and scalable approach that enhances cross-lingual transfer.<n>We construct CrossIC-PT samples by interleaving semantic-related bilingual Wikipedia documents into a single context window.<n> Experimental results demonstrate that CrossIC-PT improves multilingual performance on three models across six target languages.
arXiv Detail & Related papers (2025-04-29T07:24:25Z) - Learning Adaptive Parallel Reasoning with Language Models [70.1745752819628]
We propose Adaptive Parallel Reasoning (APR), a novel reasoning framework that enables language models to orchestrate both serialized and parallel computations end-to-end.<n> APR generalizes existing reasoning methods by enabling adaptive multi-threaded inference using spawn() and join() operations.<n>A key innovation is our end-to-end reinforcement learning strategy, optimizing both parent and child inference threads to enhance task success rate without requiring predefined reasoning structures.
arXiv Detail & Related papers (2025-04-21T22:29:02Z) - Towards Typologically Aware Rescoring to Mitigate Unfaithfulness in Lower-Resource Languages [9.426642998924724]
Multilingual large language models generate non-faithful output in resource-constrained languages.<n>To mitigate unfaithfulness in such settings, we propose using computationally light auxiliary models to rescore the outputs of larger architectures.<n>We show that monolingual 4-layer BERT models pretrained from scratch on less than 700 MB of data without fine-tuning are able to identify faithful summaries with a mean accuracy of 88.33%.
arXiv Detail & Related papers (2025-02-24T21:22:19Z) - Tokenization Standards for Linguistic Integrity: Turkish as a Benchmark [0.29687381456163997]
Tokenization is a fundamental preprocessing step in NLP, directly impacting large language models' ability to capture syntactic, morphosyntactic, and semantic structures.<n>This paper introduces a novel framework for evaluating tokenization strategies, addressing challenges in morphologically rich and low-resource languages.
arXiv Detail & Related papers (2025-02-10T21:47:49Z) - SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator [65.62084602011596]
Large Language Models (LLMs) have exhibited exceptional performance across a spectrum of natural language processing tasks.<n>We have identified a key pattern: certain seemingly meaningless separator tokens (i.e., punctuations) contribute disproportionately to attention scores compared to semantically meaningful tokens.<n>We introduce SepLLM, a plug-and-play framework that accelerates inference by compressing these segments and eliminating redundant tokens.
arXiv Detail & Related papers (2024-12-16T18:58:57Z) - PanGu-{\Sigma}: Towards Trillion Parameter Language Model with Sparse
Heterogeneous Computing [64.53242758625922]
PanGu-Sigma is trained on a cluster of Ascend 910 AI processors and MindSpore framework.
It provides state-of-the-art performance in zero-shot learning of various Chinese NLP downstream tasks.
arXiv Detail & Related papers (2023-03-20T03:39:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.