Related papers: Length-MAX Tokenizer for Language Models

Length-MAX Tokenizer for Language Models

URL: http://arxiv.org/abs/2511.20849v1
Date: Tue, 25 Nov 2025 20:56:56 GMT
Title: Length-MAX Tokenizer for Language Models
Authors: Dong Dong, Weijie Su,
Abstract summary: We introduce a new tokenizer for language models that minimizes the average tokens per character.<n>The Length-MAX tokenizer achieves 99.62% vocabulary coverage and the out-of-vocabulary rate remains low at 0.12% on test sets.
Score: 2.243087516606811
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce a new tokenizer for language models that minimizes the average tokens per character, thereby reducing the number of tokens needed to represent text during training and to generate text during inference. Our method, which we refer to as the Length-MAX tokenizer, obtains its vocabulary by casting a length-weighted objective maximization as a graph partitioning problem and developing a greedy approximation algorithm. On FineWeb and diverse domains, it yields 14--18\% fewer tokens than Byte Pair Encoding (BPE) across vocabulary sizes from 10K to 50K, and the reduction is 13.0\% when the size is 64K. Training GPT-2 models at 124M, 355M, and 1.3B parameters from scratch with five runs each shows 18.5\%, 17.2\%, and 18.5\% fewer steps, respectively, to reach a fixed validation loss, and 13.7\%, 12.7\%, and 13.7\% lower inference latency, together with a 16\% throughput gain at 124M, while consistently improving on downstream tasks including reducing LAMBADA perplexity by 11.7\% and enhancing HellaSwag accuracy by 4.3\%. Moreover, the Length-MAX tokenizer achieves 99.62\% vocabulary coverage and the out-of-vocabulary rate remains low at 0.12\% on test sets. These results demonstrate that optimizing for average token length, rather than frequency alone, offers an effective approach to more efficient language modeling without sacrificing -- and often improving -- downstream performance. The tokenizer is compatible with production systems and reduces embedding and KV-cache memory by 18\% at inference.

Related papers

Extreme Model Compression for Edge Vision-Language Models: Sparse Temporal Token Fusion and Adaptive Neural Compression [0.0]
Two adaptive compression techniques are proposed to integrate algorithmic innovations with hardware-aware optimizations.<n>On event-based vision tasks, STTF reduces average token count by 84%.<n>ANC cuts FLOPs by up to 90% in low-motion scenes.
arXiv Detail & Related papers (2025-11-23T15:43:00Z)
SupraTok: Cross-Boundary Tokenization for Enhanced Language Model Performance [1.9336815376402718]
Tokenization remains a fundamental yet underexplored bottleneck in natural language processing.<n>We present SupraTok, a novel tokenization architecture that reimagines subword segmentation.<n>Our approach achieves 31% improvement in English tokenization efficiency.
arXiv Detail & Related papers (2025-08-16T00:54:20Z)
Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations [83.93566096400723]
We find that instruction-tuned models retain up to 93.4% of their original performance when given a randomly sampled tokenization.<n>Character-level segmentation improves string manipulation and code understanding tasks by up to +14%.<n>Right-aligned digit grouping enhances large-number arithmetic by +33%.
arXiv Detail & Related papers (2025-06-23T18:02:26Z)
SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator [65.62084602011596]
Large Language Models (LLMs) have exhibited exceptional performance across a spectrum of natural language processing tasks.<n>We have identified a key pattern: certain seemingly meaningless separator tokens (i.e., punctuations) contribute disproportionately to attention scores compared to semantically meaningful tokens.<n>We introduce SepLLM, a plug-and-play framework that accelerates inference by compressing these segments and eliminating redundant tokens.
arXiv Detail & Related papers (2024-12-16T18:58:57Z)
Retrofitting Large Language Models with Dynamic Tokenization [3.608780819053423]
Current language models (LMs) use a fixed, static subword tokenizer.<n>This default choice typically results in degraded efficiency and language capabilities, especially in languages other than English.<n>We propose retrofitting LMs with dynamic tokenization: a way to dynamically decide on token boundaries based on the input text.
arXiv Detail & Related papers (2024-11-27T17:51:58Z)
Text Quality-Based Pruning for Efficient Training of Language Models [66.66259229732121]
We propose a novel method for numerically evaluating text quality in large unlabelled NLP datasets. By proposing the text quality metric, the paper establishes a framework to identify and eliminate low-quality text instances. Experimental results over multiple models and datasets demonstrate the efficacy of this approach.
arXiv Detail & Related papers (2024-04-26T18:01:25Z)
BatchPrompt: Accomplish more with less [9.204837699571788]
BatchPrompt is an efficient way to batch data within the token limit. To retain efficiency and overcome performance loss, we propose Batch Permutation and Ensembling. This is the first work to technically improve prompting efficiency of large language models.
arXiv Detail & Related papers (2023-09-01T10:44:36Z)
Efficient Speech Representation Learning with Low-Bit Quantization [32.75829498841329]
We apply and investigate recent quantization techniques on speech representation learning models. With aggressive quantization to 1 bit, we achieved 86.32% storage reduction (184.42 -> 25.23), 88% estimated runtime reduction (1.00 -> 0.12) with increased word error rate (7.06 -> 15.96). In comparison with DistillHuBERT which also aims for model compression, the 2-bit configuration yielded slightly smaller storage (35.84 vs. 46.98), better word error rate (12.68 vs. 13.37) and more efficient runtime estimated (0.15 vs. 0.73)
arXiv Detail & Related papers (2022-12-14T06:09:08Z)
Non-Parametric Adaptive Network Pruning [125.4414216272874]
We introduce non-parametric modeling to simplify the algorithm design. Inspired by the face recognition community, we use a message passing algorithm to obtain an adaptive number of exemplars. EPruner breaks the dependency on the training data in determining the "important" filters.
arXiv Detail & Related papers (2021-01-20T06:18:38Z)
Multilingual Speech Translation with Efficient Finetuning of Pretrained Models [82.22294901727933]
A minimalistic LNA (LayerNorm and Attention) finetuning can achieve zero-shot crosslingual and cross-modality transfer ability. Our approach demonstrates strong zero-shot performance in a many-to-many multilingual model.
arXiv Detail & Related papers (2020-10-24T08:15:08Z)
BitPruning: Learning Bitlengths for Aggressive and Accurate Quantization [57.14179747713731]
We introduce a training method for minimizing inference bitlength at any granularity while maintaining accuracy. With ImageNet, the method produces an average per layer bitlength of 4.13, 3.76 and 4.36 bits.
arXiv Detail & Related papers (2020-02-08T04:58:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.