Related papers: LiteToken: Removing Intermediate Merge Residues From BPE Tokenizers

LiteToken: Removing Intermediate Merge Residues From BPE Tokenizers

URL: http://arxiv.org/abs/2602.04706v1
Date: Wed, 04 Feb 2026 16:19:05 GMT
Title: LiteToken: Removing Intermediate Merge Residues From BPE Tokenizers
Authors: Yike Sun, Haotong Yang, Zhouchen Lin, Muhan Zhang,
Abstract summary: Intermediate merge residues in BPE vocabularies are frequent during merge learning so that retained in the final vocabulary, but are mostly further merged and rarely emitted when tokenizing the corpus during tokenizer usage.<n>We present a systematic empirical characterization of this phenomenon across commonly used tokenizers and introduce LiteToken, a simple method for removing residue tokens.<n>Experiments show that LiteToken reduces token fragmentation, reduces parameters, and improves robustness to noisy or misspelled inputs, while preserving overall performance.
Score: 76.59130257385826
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Tokenization is fundamental to how language models represent and process text, yet the behavior of widely used BPE tokenizers has received far less study than model architectures and training. In this paper, we investigate intermediate merge residues in BPE vocabularies: tokens that are frequent during merge learning so that retained in the final vocabulary, but are mostly further merged and rarely emitted when tokenizing the corpus during tokenizer usage. Such low-frequency tokens not only waste vocabulary capacity but also increase vulnerability to adversarial or atypical inputs. We present a systematic empirical characterization of this phenomenon across commonly used tokenizers and introduce LiteToken, a simple method for removing residue tokens. Because the affected tokens are rarely used, pretrained models can often accommodate the modified tokenizer without additional fine-tuning. Experiments show that LiteToken reduces token fragmentation, reduces parameters, and improves robustness to noisy or misspelled inputs, while preserving overall performance.

Related papers

SemToken: Semantic-Aware Tokenization for Efficient Long-Context Language Modeling [6.185573921868495]
SemToken is a semantic-aware tokenization framework that reduces token redundancy and improves efficiency.<n>It can be seamlessly integrated with modern language models and attention acceleration methods.<n>Our findings suggest that semantic structure offers a promising new axis for optimizing tokenization and computation in large language models.
arXiv Detail & Related papers (2025-08-21T03:01:53Z)
Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations [83.93566096400723]
We find that instruction-tuned models retain up to 93.4% of their original performance when given a randomly sampled tokenization.<n>Character-level segmentation improves string manipulation and code understanding tasks by up to +14%.<n>Right-aligned digit grouping enhances large-number arithmetic by +33%.
arXiv Detail & Related papers (2025-06-23T18:02:26Z)
Sampling from Your Language Model One Byte at a Time [82.71473348639489]
Tokenization can introduce distortion into the model's generations, known as the Prompt Boundary Problem (PBP)<n>We present an inference-time method to convert any autore LM with a BPE tokenizer into a character-level or byte-level LM.<n>Our method efficiently solves the PBP and is also able to unify the vocabularies of language models with different tokenizers.
arXiv Detail & Related papers (2025-06-17T02:37:04Z)
Achieving Tokenizer Flexibility in Language Models through Heuristic Adaptation and Supertoken Learning [0.0]
Pretrained language models (LLMs) are often constrained by their fixed tokenization schemes.<n> Tokenadapt is a model-agnostic tokenizer transplantation method.<n>Our framework introduces two innovations: first, Tokenadapt, a model-agnostic tokenizer transplantation method, and second, novel pre-tokenization for multi-word Supertokens.
arXiv Detail & Related papers (2025-05-14T19:00:27Z)
Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning [53.57895922042783]
Large Language Models (LLMs) excel at reasoning and planning when trained on chainof-thought (CoT) data.<n>We propose a hybrid representation of the reasoning process, where we partially abstract away the initial reasoning steps using latent discrete tokens.
arXiv Detail & Related papers (2025-02-05T15:33:00Z)
SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator [65.62084602011596]
Large Language Models (LLMs) have exhibited exceptional performance across a spectrum of natural language processing tasks.<n>We have identified a key pattern: certain seemingly meaningless separator tokens (i.e., punctuations) contribute disproportionately to attention scores compared to semantically meaningful tokens.<n>We introduce SepLLM, a plug-and-play framework that accelerates inference by compressing these segments and eliminating redundant tokens.
arXiv Detail & Related papers (2024-12-16T18:58:57Z)
LBPE: Long-token-first Tokenization to Improve Large Language Models [26.3619552256488]
Long tokens, rich in semantic information, have fewer occurrences in tokenized datasets compared to short tokens. We propose LBPE, which prioritizes long tokens during the encoding process. Experiments across diverse language modeling tasks demonstrate that LBPE consistently outperforms the original BPE.
arXiv Detail & Related papers (2024-11-08T12:03:36Z)
Improbable Bigrams Expose Vulnerabilities of Incomplete Tokens in Byte-Level Tokenizers [27.127016750061944]
We investigate incomplete tokens, i.e., undecodable tokens with stray bytes resulting from byte-level byte-pair encoding (BPE) tokenization.<n>Our experiments show that improbable bigrams are significantly prone to hallucinatory behaviors.
arXiv Detail & Related papers (2024-10-31T07:19:44Z)
Batching BPE Tokenization Merges [55.2480439325792]
BatchBPE is an open-source pure Python implementation of the Byte Pair algorithm. It is used to train a high quality tokenizer on a basic laptop.
arXiv Detail & Related papers (2024-08-05T09:37:21Z)
Scaffold-BPE: Enhancing Byte Pair Encoding for Large Language Models with Simple and Effective Scaffold Token Removal [58.29382184006158]
We propose Scaffold-BPE, which incorporates a dynamic scaffold token removal mechanism by parameter-free, computation-light, and easy-to-implement modifications to the original BPE method. On extensive experiments across language modeling and even machine translation, Scaffold-BPE consistently outperforms the original BPE.
arXiv Detail & Related papers (2024-04-27T07:12:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.