TokSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior
- URL: http://arxiv.org/abs/2512.20757v1
- Date: Tue, 23 Dec 2025 20:43:06 GMT
- Title: TokSuite: Measuring the Impact of Tokenizer Choice on Language Model Behavior
- Authors: Gül Sena Altıntaş, Malikeh Ehghaghi, Brian Lester, Fengyuan Liu, Wanru Zhao, Marco Ciccone, Colin Raffel,
- Abstract summary: Tokenizers provide the fundamental basis through which text is represented and processed by language models (LMs)<n>TokSuite is a collection of models and a benchmark that supports research into tokenization's influence on LMs.
- Score: 30.782240245074433
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Tokenizers provide the fundamental basis through which text is represented and processed by language models (LMs). Despite the importance of tokenization, its role in LM performance and behavior is poorly understood due to the challenge of measuring the impact of tokenization in isolation. To address this need, we present TokSuite, a collection of models and a benchmark that supports research into tokenization's influence on LMs. Specifically, we train fourteen models that use different tokenizers but are otherwise identical using the same architecture, dataset, training budget, and initialization. Additionally, we curate and release a new benchmark that specifically measures model performance subject to real-world perturbations that are likely to influence tokenization. Together, TokSuite allows robust decoupling of the influence of a model's tokenizer, supporting a series of novel findings that elucidate the respective benefits and shortcomings of a wide range of popular tokenizers.
Related papers
- How Different Tokenization Algorithms Impact LLMs and Transformer Models for Binary Code Analysis [0.0]
Despite its significance, tokenization in the context of assembly code remains an underexplored area.<n>We explore preprocessing customization options and pre-tokenization rules tailored to the unique characteristics of assembly code.<n>We compare tokenizers based on tokenization efficiency, vocabulary compression, and representational fidelity for assembly code.
arXiv Detail & Related papers (2025-11-05T19:45:26Z) - ssToken: Self-modulated and Semantic-aware Token Selection for LLM Fine-tuning [51.133569963553576]
ssToken is a Self-modulated and Semantic-aware Token Selection approach.<n>We show that both self-modulated selection and semantic-aware selection alone outperform full-data fine-tuning.
arXiv Detail & Related papers (2025-10-21T03:21:04Z) - Forgetting: A New Mechanism Towards Better Large Language Model Fine-tuning [51.92313556418432]
Supervised fine-tuning (SFT) plays a critical role for pretrained large language models (LLMs)<n>We suggest categorizing tokens within each corpus into two parts -- positive and negative tokens -- based on whether they are useful to improve model performance.<n>We conduct experiments on well-established benchmarks, finding that this forgetting mechanism not only improves overall model performance and also facilitate more diverse model responses.
arXiv Detail & Related papers (2025-08-06T11:22:23Z) - Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations [83.93566096400723]
We find that instruction-tuned models retain up to 93.4% of their original performance when given a randomly sampled tokenization.<n>Character-level segmentation improves string manipulation and code understanding tasks by up to +14%.<n>Right-aligned digit grouping enhances large-number arithmetic by +33%.
arXiv Detail & Related papers (2025-06-23T18:02:26Z) - Beyond Text Compression: Evaluating Tokenizers Across Scales [4.0253589606301174]
We show that tokenizer choice has negligible effects on tasks in English but results in consistent performance differences in multilingual settings.<n>We propose new intrinsic tokenizer metrics inspired by Zipf's law that correlate more strongly with downstream performance than text compression.
arXiv Detail & Related papers (2025-06-03T17:35:56Z) - Tokenization is Sensitive to Language Variation [14.568179478275255]
Tokenizers split texts into smaller units and might behave differently for less common linguistic forms.<n>This might affect downstream LLM performance differently on two types of tasks.<n>We find that the best tokenizer varies on the two task types and that the pre-tokenizer has the biggest overall impact on performance.
arXiv Detail & Related papers (2025-02-21T09:58:54Z) - Exact Byte-Level Probabilities from Tokenized Language Models for FIM-Tasks and Model Ensembles [23.134664392314264]
Tokenization is associated with many poorly understood shortcomings in language models (LMs)<n>This work studies how tokenization impacts model performance by analyzing and comparing models with their byte-level counterparts.<n>We introduce the Byte-Token Representation Lemma, a framework that establishes a mapping between the learned token distribution and its equivalent byte-level distribution.
arXiv Detail & Related papers (2024-10-11T23:30:42Z) - Can Perplexity Predict Fine-tuning Performance? An Investigation of Tokenization Effects on Sequential Language Models for Nepali [0.0]
SentencePiece tokenization consistently yields superior results on understanding-based tasks for Nepali.<n>Our research specifically examines sequential transformer models, providing valuable insights for language model development in low-resource languages.
arXiv Detail & Related papers (2024-04-28T05:26:12Z) - Revisiting Demonstration Selection Strategies in In-Context Learning [66.11652803887284]
Large language models (LLMs) have shown an impressive ability to perform a wide range of tasks using in-context learning (ICL)
In this work, we first revisit the factors contributing to this variance from both data and model aspects, and find that the choice of demonstration is both data- and model-dependent.
We propose a data- and model-dependent demonstration selection method, textbfTopK + ConE, based on the assumption that textitthe performance of a demonstration positively correlates with its contribution to the model's understanding of the test samples.
arXiv Detail & Related papers (2024-01-22T16:25:27Z) - Identifying and Analyzing Performance-Critical Tokens in Large Language Models [52.404072802235234]
We study how large language models learn to perform tasks from demonstrations.<n>Our work sheds light on how large language models learn to perform tasks from demonstrations and deepens our understanding of the roles different types of tokens play in large language models.
arXiv Detail & Related papers (2024-01-20T20:55:21Z) - Improving Input-label Mapping with Demonstration Replay for In-context
Learning [67.57288926736923]
In-context learning (ICL) is an emerging capability of large autoregressive language models.
We propose a novel ICL method called Sliding Causal Attention (RdSca)
We show that our method significantly improves the input-label mapping in ICL demonstrations.
arXiv Detail & Related papers (2023-10-30T14:29:41Z) - An Information Extraction Study: Take In Mind the Tokenization! [18.20319269401045]
We study the impact of tokenization when extracting information from documents.
We present a comparative study and analysis of subword-based and character-based models.
The main outcome is twofold: tokenization patterns can introduce inductive bias that results in state-of-the-art performance.
arXiv Detail & Related papers (2023-03-27T11:08:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.