Are you going to finish that? A Practical Study of the Partial Token Problem
- URL: http://arxiv.org/abs/2601.23223v2
- Date: Mon, 02 Feb 2026 21:48:06 GMT
- Title: Are you going to finish that? A Practical Study of the Partial Token Problem
- Authors: Hao Xu, Alisa Liu, Jonathan Hayase, Yejin Choi, Noah A. Smith,
- Abstract summary: Language models (LMs) are trained over sequences of tokens, whereas users interact with LMs via text.<n>This mismatch gives rise to the partial token problem, which occurs when a user ends their prompt in the middle of the expected next-token.<n>In this work, we identify three domains where token and "word" boundaries often do not line up.
- Score: 85.49816027251013
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language models (LMs) are trained over sequences of tokens, whereas users interact with LMs via text. This mismatch gives rise to the partial token problem, which occurs when a user ends their prompt in the middle of the expected next-token, leading to distorted next-token predictions. Although this issue has been studied using arbitrary character prefixes, its prevalence and severity in realistic prompts respecting word boundaries remains underexplored. In this work, we identify three domains where token and "word" boundaries often do not line up: languages that do not use whitespace, highly compounding languages, and code. In Chinese, for example, up to 25% of word boundaries do not line up with token boundaries, making even natural, word-complete prompts susceptible to this problem. We systematically construct semantically natural prompts ending with a partial tokens; in experiments, we find that they comprise a serious failure mode: frontier LMs consistently place three orders of magnitude less probability on the correct continuation compared to when the prompt is "backed-off" to be token-aligned. This degradation does not diminish with scale and often worsens for larger models. Finally, we evaluate inference-time mitigations to the partial token problem and validate the effectiveness of recent exact solutions. Overall, we demonstrate the scale and severity of probability distortion caused by tokenization in realistic use cases, and provide practical recommentions for model inference providers.
Related papers
- LiteToken: Removing Intermediate Merge Residues From BPE Tokenizers [76.59130257385826]
Intermediate merge residues in BPE vocabularies are frequent during merge learning so that retained in the final vocabulary, but are mostly further merged and rarely emitted when tokenizing the corpus during tokenizer usage.<n>We present a systematic empirical characterization of this phenomenon across commonly used tokenizers and introduce LiteToken, a simple method for removing residue tokens.<n>Experiments show that LiteToken reduces token fragmentation, reduces parameters, and improves robustness to noisy or misspelled inputs, while preserving overall performance.
arXiv Detail & Related papers (2026-02-04T16:19:05Z) - FrugalPrompt: Reducing Contextual Overhead in Large Language Models via Token Attribution [3.4666771782038652]
Large language models (LLMs) owe much of their stellar performance to expansive input contexts, yet such verbosity inflates monetary costs, carbon footprint, and inference-time latency.<n>We introduce FrugalPrompt, a novel prompt compression framework for LLMs, which retains only the most semantically significant tokens.<n>We evaluate the approach across four NLP tasks: Sentiment Analysis, Commonsense QA, Summarization, and Mathematical Reasoning.
arXiv Detail & Related papers (2025-10-18T10:22:13Z) - Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations [83.93566096400723]
We find that instruction-tuned models retain up to 93.4% of their original performance when given a randomly sampled tokenization.<n>Character-level segmentation improves string manipulation and code understanding tasks by up to +14%.<n>Right-aligned digit grouping enhances large-number arithmetic by +33%.
arXiv Detail & Related papers (2025-06-23T18:02:26Z) - Sampling from Your Language Model One Byte at a Time [82.71473348639489]
Tokenization can introduce distortion into the model's generations, known as the Prompt Boundary Problem (PBP)<n>We present an inference-time method to convert any autore LM with a BPE tokenizer into a character-level or byte-level LM.<n>Our method efficiently solves the PBP and is also able to unify the vocabularies of language models with different tokenizers.
arXiv Detail & Related papers (2025-06-17T02:37:04Z) - Causal Estimation of Tokenisation Bias [58.20086589761273]
We quantify the effect of including or not a subword in a tokeniser's vocabulary on the probability a trained model assigns to the corresponding characters.<n>We find that tokenisation consistently affects models' outputs across scales, vocabularies, and tokenisers.<n> Notably, a subword's presence in a small model's vocabulary may increase its characters' probability by up to 17 times.
arXiv Detail & Related papers (2025-06-03T17:59:47Z) - Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling [90.86991492288487]
evaluating constraint on every token can be prohibitively expensive.<n> LCD can distort the global distribution over strings, sampling tokens based only on local information.<n>We show that our approach is superior to state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-07T18:30:18Z) - Tokenization as Finite-State Transduction [24.19959327497118]
We introduce a finite-state framework which can efficiently encode all possible tokenizations of a regular language.
We show that Byte-Pair.
Match (BPE) and MaxPiece (WordPiece) fit within this framework.
An application of this is to guided generation, where the outputs of a language model are constrained to match some pattern.
arXiv Detail & Related papers (2024-10-21T07:10:07Z) - Tokenization Falling Short: On Subword Robustness in Large Language Models [12.193639356480851]
This study systematically investigates these challenges and their impact on language models.
Our findings reveal that scaling model parameters can mitigate the issue of tokenization.
Our experiments show that subword regularization such as BPE-dropout can mitigate this issue.
arXiv Detail & Related papers (2024-06-17T16:05:32Z) - Leading Whitespaces of Language Models' Subword Vocabulary Pose a Confound for Calculating Word Probabilities [15.073507986272027]
We argue that there is a confound posed by the most common method of aggregating subword probabilities into word probabilities.
This is due to the fact that tokens in the subword vocabulary of most language models have leading whitespaces.
We present a simple decoding technique to reaccount the probability of the trailing whitespace into that of the current word.
arXiv Detail & Related papers (2024-06-16T08:44:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.