Broken Words, Broken Performance: Effect of Tokenization on Performance of LLMs
- URL: http://arxiv.org/abs/2512.21933v1
- Date: Fri, 26 Dec 2025 09:16:33 GMT
- Title: Broken Words, Broken Performance: Effect of Tokenization on Performance of LLMs
- Authors: Sachin Pawar, Manoj Apte, Kshitij Jadhav, Girish Keshav Palshikar, Nitin Ramrakhiyani,
- Abstract summary: Tokenization is the first step in training any Large Language Model (LLM)<n>We propose a set of penalty functions that compute a tokenization penalty for a given text for a specific LLM, indicating how "bad" the tokenization is.
- Score: 2.2574632480801484
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Tokenization is the first step in training any Large Language Model (LLM), where the text is split into a sequence of tokens as per the model's fixed vocabulary. This tokenization in LLMs is different from the traditional tokenization in NLP where the text is split into a sequence of "natural" words. In LLMs, a natural word may also be broken into multiple tokens due to limited vocabulary size of the LLMs (e.g., Mistral's tokenizer splits "martial" into "mart" and "ial"). In this paper, we hypothesize that such breaking of natural words negatively impacts LLM performance on various NLP tasks. To quantify this effect, we propose a set of penalty functions that compute a tokenization penalty for a given text for a specific LLM, indicating how "bad" the tokenization is. We establish statistical significance of our hypothesis on multiple NLP tasks for a set of different LLMs.
Related papers
- TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar [8.34539885321864]
We show that semantically identical code snippets can be tokenized differently depending on superficial factors such as whitespace or identifier naming.<n>We introduce TokDrift, a framework that applies semantic-preserving rewrite rules to create code variants differing only in tokenization.<n>Our findings identify misaligned tokenization as a hidden obstacle to reliable code understanding and generation.
arXiv Detail & Related papers (2025-10-16T17:59:45Z) - Sampling from Your Language Model One Byte at a Time [82.71473348639489]
Tokenization can introduce distortion into the model's generations, known as the Prompt Boundary Problem (PBP)<n>We present an inference-time method to convert any autore LM with a BPE tokenizer into a character-level or byte-level LM.<n>Our method efficiently solves the PBP and is also able to unify the vocabularies of language models with different tokenizers.
arXiv Detail & Related papers (2025-06-17T02:37:04Z) - TokAlign: Efficient Vocabulary Adaptation via Token Alignment [41.59130966729569]
Tokenization serves as a foundational step for Large Language Models (LLMs) to process text.<n>In new domains or languages, the inefficiency of the tokenizer will slow down the training and generation of LLM.<n>We propose an efficient method named TokAlign to replace the vocabulary of LLM from the token co-occurrences view.
arXiv Detail & Related papers (2025-06-04T03:15:57Z) - Segment First or Comprehend First? Explore the Limit of Unsupervised Word Segmentation with Large Language Models [92.92512796044471]
We propose a new framework to explore the limit of unsupervised word segmentation with Large Language Models (LLMs)<n>We employ current mainstream LLMs to perform word segmentation across multiple languages to assess LLMs' "comprehension"<n>We introduce a novel unsupervised method, termed LLACA, which enables the construction of a dynamic $n$-gram model that adjusts based on contextual information.
arXiv Detail & Related papers (2025-05-26T07:48:15Z) - Idiosyncrasies in Large Language Models [54.26923012617675]
We unveil and study idiosyncrasies in Large Language Models (LLMs)<n>We find that fine-tuning text embedding models on LLM-generated texts yields excellent classification accuracy.<n>We leverage LLM as judges to generate detailed, open-ended descriptions of each model's idiosyncrasies.
arXiv Detail & Related papers (2025-02-17T18:59:02Z) - From Tokens to Words: On the Inner Lexicon of LLMs [7.148628740938674]
Natural language is composed of words, but modern large language models (LLMs) process sub-words as input.<n>We present evidence that LLMs engage in an intrinsic detokenization process, where sub-word sequences are combined into coherent whole-word representations.<n>Our findings suggest that LLMs maintain a latent vocabulary beyond the tokenizer's scope.
arXiv Detail & Related papers (2024-10-08T09:53:35Z) - CUTE: Measuring LLMs' Understanding of Their Tokens [54.70665106141121]
Large Language Models (LLMs) show remarkable performance on a wide variety of tasks.
This raises the question: To what extent can LLMs learn orthographic information?
We propose a new benchmark, which features a collection of tasks designed to test the orthographic knowledge of LLMs.
arXiv Detail & Related papers (2024-09-23T18:27:03Z) - Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs [20.1025293763531]
Llama-2-7b's tokenizer splits the word "northeastern" into the tokens ['_n', 'ort', 'he', 'astern'], none of which correspond to semantically meaningful units like "north" or "east"
In this work, we find that last token representations of named entities and multi-token words exhibit a pronounced "erasure" effect, where information about previous and current tokens is rapidly forgotten in early layers.
arXiv Detail & Related papers (2024-06-28T17:54:47Z) - Implicit Multimodal Alignment: On the Generalization of Frozen LLMs to Multimodal Inputs [63.29737699997859]
Large Language Models (LLMs) have demonstrated impressive performance on multimodal tasks, without any multimodal finetuning.
In this work, we expose frozen LLMs to image, video, audio and text inputs and analyse their internal representation.
arXiv Detail & Related papers (2024-05-26T21:31:59Z) - IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact [46.32830393597601]
Large language models (LLMs) excel in natural language processing but demand intensive computation.
This paper unveils a previously overlooked type of outliers in LLMs.
We propose IntactKV to generate the KV cache of pivot tokens losslessly from the full-precision model.
arXiv Detail & Related papers (2024-03-02T16:05:26Z) - Tokenization Matters: Navigating Data-Scarce Tokenization for Gender Inclusive Language Technologies [75.85462924188076]
Gender-inclusive NLP research has documented the harmful limitations of gender binary-centric large language models (LLM)
We find that misgendering is significantly influenced by Byte-Pair (BPE) tokenization.
We propose two techniques: (1) pronoun tokenization parity, a method to enforce consistent tokenization across gendered pronouns, and (2) utilizing pre-existing LLM pronoun knowledge to improve neopronoun proficiency.
arXiv Detail & Related papers (2023-12-19T01:28:46Z) - Transcormer: Transformer for Sentence Scoring with Sliding Language
Modeling [95.9542389945259]
Sentence scoring aims at measuring the likelihood of a sentence and is widely used in many natural language processing scenarios.
We propose textitTranscormer -- a Transformer model with a novel textitsliding language modeling (SLM) for sentence scoring.
arXiv Detail & Related papers (2022-05-25T18:00:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.