From Characters to Tokens: Dynamic Grouping with Hierarchical BPE
- URL: http://arxiv.org/abs/2510.15517v1
- Date: Fri, 17 Oct 2025 10:42:10 GMT
- Title: From Characters to Tokens: Dynamic Grouping with Hierarchical BPE
- Authors: Rares Dolga, Lucas Maystre, Tudor Berariu, David Barber,
- Abstract summary: Subword tokenization methods are widely used in large language models.<n>They suffer from inefficiencies in representing rare words and require large embedding matrices.<n>We propose a dynamic character grouping method that leverages the structure of existing BPE tokenization.
- Score: 7.301118515210817
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Subword tokenization methods like Byte Pair Encoding (BPE) are widely used in large language models due to their balance of vocabulary compactness and representational power. However, they suffer from inefficiencies in representing rare words and require large embedding matrices. Character-level models address these issues but introduce performance bottlenecks, particularly in Transformer-based architectures. Recent hierarchical models attempt to merge the benefits of both paradigms by grouping characters into patches, but existing patching strategies either rely on whitespace-limiting applicability to certain languages, or require auxiliary models that introduce new dependencies. In this paper, we propose a dynamic character grouping method that leverages the structure of existing BPE tokenization without requiring additional models. By appending explicit end-of-patch markers to BPE tokens and introducing a second-level BPE compression stage to control patch granularity, our method offers efficient, flexible, and language-agnostic representations. Empirical results demonstrate that our approach matches or exceeds the performance of dynamic entropy- and whitespace-based patching strategies, while maintaining a compact vocabulary.
Related papers
- LiteToken: Removing Intermediate Merge Residues From BPE Tokenizers [76.59130257385826]
Intermediate merge residues in BPE vocabularies are frequent during merge learning so that retained in the final vocabulary, but are mostly further merged and rarely emitted when tokenizing the corpus during tokenizer usage.<n>We present a systematic empirical characterization of this phenomenon across commonly used tokenizers and introduce LiteToken, a simple method for removing residue tokens.<n>Experiments show that LiteToken reduces token fragmentation, reduces parameters, and improves robustness to noisy or misspelled inputs, while preserving overall performance.
arXiv Detail & Related papers (2026-02-04T16:19:05Z) - AdaptBPE: From General Purpose to Specialized Tokenizers [18.70903226766322]
We propose a post-training adaptation strategy that selectively replaces low-utility tokens with more relevant ones based on their frequency in an adaptation corpus.<n>Our algorithm identifies the token inventory that most effectively encodes the adaptation corpus for a given target vocabulary size.<n>This method serves as a lightweight adaptation mechanism, akin to a vocabulary fine-tuning process, enabling optimized tokenization for specific domains or tasks.
arXiv Detail & Related papers (2026-01-29T12:59:40Z) - Positional Preservation Embedding for Multimodal Large Language Models [20.307929204794917]
Multimodal language models (LMLMs) have achieved strong performance on vision-language tasks, yet often suffer from inefficiencies due to redundant visual tokens.<n>In this work, we propose a novel encoding novel spatial preservation structure during token compression.<n>We show that PPE can effectively support clustering -- a progressive token compression strategy that leads to better performance retention.
arXiv Detail & Related papers (2025-10-27T02:40:02Z) - Dynamic Chunking for End-to-End Hierarchical Sequence Modeling [17.277753030570263]
We introduce techniques that enable a dynamic chunking mechanism which automatically learns content- and context- dependent segmentation strategies.<n> incorporating this into an explicit hierarchical network (H-Net) allows replacing the (implicitly hierarchical) tokenization-LM-detokenization pipeline with a single model learned fully end-to-end.<n>Iterating the hierarchy to multiple stages further increases its performance by modeling multiple levels of abstraction.<n>H-Nets pretrained on English show significantly increased character-level robustness, and qualitatively learn meaningful data-dependent chunking strategies without anys or explicit supervision.
arXiv Detail & Related papers (2025-07-10T17:39:37Z) - Sampling from Your Language Model One Byte at a Time [82.71473348639489]
Tokenization can introduce distortion into the model's generations, known as the Prompt Boundary Problem (PBP)<n>We present an inference-time method to convert any autore LM with a BPE tokenizer into a character-level or byte-level LM.<n>Our method efficiently solves the PBP and is also able to unify the vocabularies of language models with different tokenizers.
arXiv Detail & Related papers (2025-06-17T02:37:04Z) - MorphTok: Morphologically Grounded Tokenization for Indian Languages [23.58043476541051]
Tokenization is a crucial step in NLP, especially with the rise of large language models (LLMs)<n>We propose morphology-aware segmentation as a pre-tokenization step prior to applying subword tokenization.<n>We also introduce Constrained BPE, an extension to the traditional BPE algorithm that incorporates script-specific constraints.
arXiv Detail & Related papers (2025-04-14T15:44:45Z) - Hierarchical Autoregressive Transformers: Combining Byte- and Word-Level Processing for Robust, Adaptable Language Models [3.382910438968506]
Tokenization is a fundamental step in natural language processing, breaking text into units that computational models can process.<n>We investigate a hierarchical architecture for autoregressive language modelling that combines character-level and word-level processing.<n>We demonstrate, at scales up to 7 billion parameters, that hierarchical transformers match the downstream task performance of subword-tokenizer-based models.
arXiv Detail & Related papers (2025-01-17T17:51:53Z) - Rethinking Addressing in Language Models via Contexualized Equivariant Positional Encoding [89.52931576290976]
We present contextbfTextualized equivaritextbfAnt textbfPosition textbfEncoding (textbfTAPE), a novel framework that enhances positional embeddings by incorporating sequence content across layers.<n>Our method can be easily integrated into pre-trained transformers, offering parameter-efficient fine-tuning with minimal overhead.
arXiv Detail & Related papers (2025-01-01T03:23:00Z) - When Every Token Counts: Optimal Segmentation for Low-Resource Language Models [0.0]
We show that an optimal Byte-Pair (BPE) configuration significantly reduces token count compared to greedy segmentation.<n>Our findings suggest that compression-optimized tokenization strategies could provide substantial advantages for multilingual and low-resource language applications.
arXiv Detail & Related papers (2024-12-09T19:11:54Z) - OFA: A Framework of Initializing Unseen Subword Embeddings for Efficient Large-scale Multilingual Continued Pretraining [49.213120730582354]
Instead of pretraining multilingual language models from scratch, a more efficient method is to adapt existing pretrained language models (PLMs) to new languages via vocabulary extension and continued pretraining.
We propose a novel framework: $textbfO$ne $textbfF$or $textbfA$ll, which wisely initializes the embeddings of unseen subwords and thus can adapt a PLM to multiple languages efficiently and effectively.
arXiv Detail & Related papers (2023-11-15T10:40:45Z) - Pushdown Layers: Encoding Recursive Structure in Transformer Language
Models [86.75729087623259]
Recursion is a prominent feature of human language, and fundamentally challenging for self-attention.
This work introduces Pushdown Layers, a new self-attention layer.
Transformers equipped with Pushdown Layers achieve dramatically better and 3-5x more sample-efficient syntactic generalization.
arXiv Detail & Related papers (2023-10-29T17:27:18Z) - Improve Variational Autoencoder for Text Generationwith Discrete Latent
Bottleneck [52.08901549360262]
Variational autoencoders (VAEs) are essential tools in end-to-end representation learning.
VAEs tend to ignore latent variables with a strong auto-regressive decoder.
We propose a principled approach to enforce an implicit latent feature matching in a more compact latent space.
arXiv Detail & Related papers (2020-04-22T14:41:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.