ByteSpan: Information-Driven Subword Tokenisation
- URL: http://arxiv.org/abs/2506.18639v1
- Date: Mon, 23 Jun 2025 13:42:00 GMT
- Title: ByteSpan: Information-Driven Subword Tokenisation
- Authors: Zébulon Goriely, Suchir Salhan, Pietro Lesci, Julius Cheng, Paula Buttery,
- Abstract summary: We propose a new information-driven subword tokeniser, ByteSpan, that uses an external byte-level LM during training to identify contiguous predictable byte sequences.<n>Experiments show that ByteSpan yields efficient vocabularies with higher morphological alignment scores than BPE for English.
- Score: 2.4723044036055306
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent dynamic tokenisation methods operate directly on bytes and pool their latent representations into patches. This bears similarities to computational models of word segmentation that determine lexical boundaries using spikes in an autoregressive model's prediction error. Inspired by this connection, we explore whether grouping predictable bytes - rather than pooling their representations - can yield a useful fixed subword vocabulary. We propose a new information-driven subword tokeniser, ByteSpan, that uses an external byte-level LM during training to identify contiguous predictable byte sequences and group them into subwords. Experiments show that ByteSpan yields efficient vocabularies with higher morphological alignment scores than BPE for English. Multilingual experiments show similar compression and R\'enyi efficiency for 25 languages.
Related papers
- Sampling from Your Language Model One Byte at a Time [82.71473348639489]
Tokenization can introduce distortion into the model's generations, known as the Prompt Boundary Problem (PBP)<n>We present an inference-time method to convert any autore LM with a BPE tokenizer into a character-level or byte-level LM.<n>Our method efficiently solves the PBP and is also able to unify the vocabularies of language models with different tokenizers.
arXiv Detail & Related papers (2025-06-17T02:37:04Z) - Comparative analysis of subword tokenization approaches for Indian languages [5.012314384895538]
Tokenization is the act of breaking down text into smaller parts, or tokens, that are easier for machines to process.<n>Subword tokenization enhances this process by breaking down words into smaller subword units.<n>It is useful in capturing the intricate structure of words in Indian languages (ILs), such as prefixes, suffixes, and other morphological variations.<n>This paper examines how different subword tokenization techniques, such as SentencePiece, Byte Pair, and WordPiece Tokenization, affect ILs.
arXiv Detail & Related papers (2025-05-22T16:24:37Z) - MorphTok: Morphologically Grounded Tokenization for Indian Languages [23.58043476541051]
Tokenization is a crucial step in NLP, especially with the rise of large language models (LLMs)<n>We propose morphology-aware segmentation as a pre-tokenization step prior to applying subword tokenization.<n>We also introduce Constrained BPE, an extension to the traditional BPE algorithm that incorporates script-specific constraints.
arXiv Detail & Related papers (2025-04-14T15:44:45Z) - Retrofitting Large Language Models with Dynamic Tokenization [3.608780819053423]
Current language models (LMs) use a fixed, static subword tokenizer.<n>This default choice typically results in degraded efficiency and language capabilities, especially in languages other than English.<n>We propose retrofitting LMs with dynamic tokenization: a way to dynamically decide on token boundaries based on the input text.
arXiv Detail & Related papers (2024-11-27T17:51:58Z) - Batching BPE Tokenization Merges [55.2480439325792]
BatchBPE is an open-source pure Python implementation of the Byte Pair algorithm.
It is used to train a high quality tokenizer on a basic laptop.
arXiv Detail & Related papers (2024-08-05T09:37:21Z) - MAGNET: Improving the Multilingual Fairness of Language Models with Adaptive Gradient-Based Tokenization [81.83460411131931]
In multilingual settings, non-Latin scripts and low-resource languages are usually disadvantaged in terms of language models' utility, efficiency, and cost.
We propose multilingual adaptive gradient-based tokenization to reduce over-segmentation via adaptive gradient-based subword tokenization.
arXiv Detail & Related papers (2024-07-11T18:59:21Z) - An Analysis of BPE Vocabulary Trimming in Neural Machine Translation [56.383793805299234]
vocabulary trimming is a postprocessing step that replaces rare subwords with their component subwords.
We show that vocabulary trimming fails to improve performance and is even prone to incurring heavy degradation.
arXiv Detail & Related papers (2024-03-30T15:29:49Z) - Local Byte Fusion for Neural Machine Translation [19.16966721276286]
Subword tokenization schemes are the dominant technique used in current NLP models.
Byte-based methods i.e. tokenization into byte sequences are an alternative.
Experiments on multilingual translation, zero-shot cross-lingual transfer, and domain adaptation reveal a consistent improvement over traditional models.
arXiv Detail & Related papers (2022-05-23T17:49:02Z) - A Vocabulary-Free Multilingual Neural Tokenizer for End-to-End Task
Learning [8.052271364177988]
Subword tokenization is a commonly used input pre-processing step in most recent NLP models.
We propose a vocabulary-free neural tokenizer by distilling segmentation information from subword tokenization.
Our tokenizer consistently improves performance on multilingual (NLI) and code-switching (sentiment analysis) tasks.
arXiv Detail & Related papers (2022-04-22T16:50:49Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z) - Charformer: Fast Character Transformers via Gradient-based Subword
Tokenization [50.16128796194463]
We propose a new model inductive bias that learns a subword tokenization end-to-end as part of the model.
We introduce a soft gradient-based subword tokenization module (GBST) that automatically learns latent subword representations from characters.
We additionally introduce Charformer, a deep Transformer model that integrates GBST and operates on the byte level.
arXiv Detail & Related papers (2021-06-23T22:24:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.