Related papers: How Long Is a Piece of String? A Brief Empirical Analysis of Tokenizers

How Long Is a Piece of String? A Brief Empirical Analysis of Tokenizers

URL: http://arxiv.org/abs/2601.11518v1
Date: Fri, 16 Jan 2026 18:58:29 GMT
Title: How Long Is a Piece of String? A Brief Empirical Analysis of Tokenizers
Authors: Jonathan Roberts, Kai Han, Samuel Albanie,
Abstract summary: tokenization varies significantly across models and domains of text, making naive interpretation of token counts problematic.<n>Our analysis challenges commonly held intuitions about token lengths, finding them to be overly simplistic.
Score: 39.60188078597529
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Frontier LLMs are increasingly utilised across academia, society and industry. A commonly used unit for comparing models, their inputs and outputs, and estimating inference pricing is the token. In general, tokens are used as a stable currency, assumed to be broadly consistent across tokenizers and contexts, enabling direct comparisons. However, tokenization varies significantly across models and domains of text, making naive interpretation of token counts problematic. We quantify this variation by providing a comprehensive empirical analysis of tokenization, exploring the compression of sequences to tokens across different distributions of textual data. Our analysis challenges commonly held heuristics about token lengths, finding them to be overly simplistic. We hope the insights of our study add clarity and intuition toward tokenization in contemporary LLMs.

Related papers

Are you going to finish that? A Practical Study of the Partial Token Problem [85.49816027251013]
Language models (LMs) are trained over sequences of tokens, whereas users interact with LMs via text.<n>This mismatch gives rise to the partial token problem, which occurs when a user ends their prompt in the middle of the expected next-token.<n>In this work, we identify three domains where token and "word" boundaries often do not line up.
arXiv Detail & Related papers (2026-01-30T17:47:16Z)
SCOPE: Saliency-Coverage Oriented Token Pruning for Efficient Multimodel LLMs [59.415473779171315]
We propose a novel visual token pruning strategy called textbfSaliency-textbfCoverage textbfOriented token textbfPruning for textbfEfficient MLLMs.
arXiv Detail & Related papers (2025-10-28T09:29:37Z)
FrugalPrompt: Reducing Contextual Overhead in Large Language Models via Token Attribution [3.4666771782038652]
Large language models (LLMs) owe much of their stellar performance to expansive input contexts, yet such verbosity inflates monetary costs, carbon footprint, and inference-time latency.<n>We introduce FrugalPrompt, a novel prompt compression framework for LLMs, which retains only the most semantically significant tokens.<n>We evaluate the approach across four NLP tasks: Sentiment Analysis, Commonsense QA, Summarization, and Mathematical Reasoning.
arXiv Detail & Related papers (2025-10-18T10:22:13Z)
MARCOS: Deep Thinking by Markov Chain of Continuous Thoughts [82.46857666702924]
We present a new paradigm for reasoning in large language models (LLMs)<n>Instead of autoregressively generating tokens, we model reasoning as a hidden Markov chain of continuous, high-dimensional "thoughts"<n>For the first time, MARCOS achieves performance comparable to token-based CoT, even surpassing it by 4.7% on GSM8K with up to 15.7x speedup in inference.
arXiv Detail & Related papers (2025-09-29T16:44:22Z)
Predictive Auditing of Hidden Tokens in LLM APIs via Reasoning Length Estimation [7.928002407828304]
Commercial LLM services often conceal internal reasoning traces while still charging users for every generated token.<n> PALACE estimates hidden reasoning token counts from prompt-answer pairs without access to internal traces.<n>Experiments on math, coding, medical, and general reasoning benchmarks show that PALACE achieves low relative error and strong prediction accuracy.
arXiv Detail & Related papers (2025-07-29T19:50:55Z)
Sticking to the Mean: Detecting Sticky Tokens in Text Embedding Models [29.98662898456327]
'sticky tokens' can undermine the reliability of embeddings in Transformer-based text embedding models.<n>We show that sticky tokens disproportionately dominate the model's internal representations, raising concerns about tokenization robustness.<n>Our findings show the need for better tokenization strategies and model design to mitigate the impact of sticky tokens in future text embedding applications.
arXiv Detail & Related papers (2025-07-24T08:13:16Z)
Causal Estimation of Tokenisation Bias [58.20086589761273]
We quantify the effect of including or not a subword in a tokeniser's vocabulary on the probability a trained model assigns to the corresponding characters.<n>We find that tokenisation consistently affects models' outputs across scales, vocabularies, and tokenisers.<n> Notably, a subword's presence in a small model's vocabulary may increase its characters' probability by up to 17 times.
arXiv Detail & Related papers (2025-06-03T17:59:47Z)
Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning [53.57895922042783]
Large Language Models (LLMs) excel at reasoning and planning when trained on chainof-thought (CoT) data.<n>We propose a hybrid representation of the reasoning process, where we partially abstract away the initial reasoning steps using latent discrete tokens.
arXiv Detail & Related papers (2025-02-05T15:33:00Z)
Forking Paths in Neural Text Generation [14.75166317633176]
We develop a novel approach to representing uncertainty dynamics across individual tokens of text generation.<n>We use our method to analyze LLM responses on 7 different tasks across 4 domains.<n>We find many examples of forking tokens, including surprising ones such as punctuation marks.
arXiv Detail & Related papers (2024-12-10T22:57:57Z)
A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners [58.15511660018742]
This study introduces a hypothesis-testing framework to assess whether large language models (LLMs) possess genuine reasoning abilities. We develop carefully controlled synthetic datasets, featuring conjunction fallacy and syllogistic problems.
arXiv Detail & Related papers (2024-06-16T19:22:53Z)
Tokenization Is More Than Compression [14.939912120571728]
Existing tokenization approaches like Byte-Pair. (BPE) originate from the field of data compression. We introduce PathPiece, a new tokenizer that segments a document's text into the minimum number of tokens for a given vocabulary.
arXiv Detail & Related papers (2024-02-28T14:52:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.