MorphTok: Morphologically Grounded Tokenization for Indian Languages
- URL: http://arxiv.org/abs/2504.10335v2
- Date: Fri, 07 Nov 2025 15:53:49 GMT
- Title: MorphTok: Morphologically Grounded Tokenization for Indian Languages
- Authors: Maharaj Brahma, N J Karthika, Atul Singh, Devaraj Adiga, Smruti Bhate, Ganesh Ramakrishnan, Rohit Saluja, Maunendra Sankar Desarkar,
- Abstract summary: Tokenization is a crucial step in NLP, especially with the rise of large language models (LLMs)<n>We propose morphology-aware segmentation as a pre-tokenization step before applying the classical Byte-pair.<n>To handle the dependent vowels common in syllable-based writing systems, we propose Constrained BPE (CBPE)<n>CBPE handles dependent vowels to form a cohesive unit with other characters instead of occurring as a single unit.
- Score: 18.594241501479747
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Tokenization is a crucial step in NLP, especially with the rise of large language models (LLMs), impacting downstream performance, computational cost, and efficiency. Existing LLMs rely on the classical Byte-pair Encoding (BPE) algorithm for subword tokenization that greedily merges frequent character bigrams, often leading to segmentation that does not align with linguistically meaningful units. To address this, we propose morphology-aware segmentation as a pre-tokenization step before applying BPE. To facilitate morphology-aware segmentation, we create a novel dataset for Hindi and Marathi, incorporating sandhi splitting to enhance the subword tokenization. Experiments on downstream tasks show that morphologically grounded tokenization improves machine translation and language modeling performance. Additionally, to handle the dependent vowels common in syllable-based writing systems used by Indic languages, we propose Constrained BPE (CBPE), an extension to the standard BPE algorithm incorporating script-specific constraints. In particular, CBPE handles dependent vowels to form a cohesive unit with other characters instead of occurring as a single unit. Our results show that CBPE achieves a 1.68\% reduction in fertility scores while maintaining comparable or improved downstream performance in machine translation and language modeling, offering a computationally efficient alternative to standard BPE. Moreover, to evaluate segmentation across different tokenization algorithms, we introduce a new human evaluation metric, \textit{EvalTok}, enabling more human-grounded assessment.
Related papers
- AdaptBPE: From General Purpose to Specialized Tokenizers [18.70903226766322]
We propose a post-training adaptation strategy that selectively replaces low-utility tokens with more relevant ones based on their frequency in an adaptation corpus.<n>Our algorithm identifies the token inventory that most effectively encodes the adaptation corpus for a given target vocabulary size.<n>This method serves as a lightweight adaptation mechanism, akin to a vocabulary fine-tuning process, enabling optimized tokenization for specific domains or tasks.
arXiv Detail & Related papers (2026-01-29T12:59:40Z) - Tokens with Meaning: A Hybrid Tokenization Approach for NLP [0.2826977330147589]
Tokenization plays a pivotal role in natural language processing (NLP)<n>We introduce a hybrid tokenization framework that combines rule-based morphological analysis with statistical subword segmentation.<n>The method uses phono normalization, root-affix, and a novel algorithm that balances morpheme preservation with vocabulary efficiency.
arXiv Detail & Related papers (2025-08-19T22:17:42Z) - Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization [62.35048154917945]
Tokenization is the first -- and often least scrutinized -- step of most NLP pipelines.<n>Standard algorithms for learning tokenizers rely on frequency-based objectives.<n>We introduce Parity-aware Byte Pair.<n>We find empirically that Parity-aware BPE leads to more equitable token counts across languages.
arXiv Detail & Related papers (2025-08-06T18:14:43Z) - Entropy-Driven Pre-Tokenization for Byte-Pair Encoding [4.145560327709288]
Two entropy-informed pre-tokenization strategies guide BPE segmentation using unsupervised information-theoretic cues.<n>We evaluate both methods on a subset of the PKU dataset and demonstrate substantial improvements in segmentation precision, recall, and F1 score compared to standard BPE.
arXiv Detail & Related papers (2025-06-18T21:25:55Z) - HYPEROFA: Expanding LLM Vocabulary to New Languages via Hypernetwork-Based Embedding Initialization [50.27950279695363]
Many pre-trained language models (PLMs) exhibit suboptimal performance on mid- and low-resource languages.<n>A common strategy to address this is to introduce new tokens specific to the target languages, initialize their embeddings, and apply continual pre-training on target-language data.<n>We propose HYPEROFA, a hypernetwork-based approach for more adaptive token embedding.
arXiv Detail & Related papers (2025-04-21T19:40:32Z) - SuperBPE: Space Travel for Language Models [112.64910939119056]
We introduce a "superword" tokenizer, SuperBPE, which incorporates a simple pretokenization curriculum into the byte-pair encoding (BPE) algorithm.
SuperBPE encodes a fixed piece of text with up to 33% fewer tokens than BPE on average.
Our model trained with SuperBPE achieves an average +4.0% absolute improvement over the BPE baseline across 30 downstream tasks.
arXiv Detail & Related papers (2025-03-17T17:53:23Z) - When Every Token Counts: Optimal Segmentation for Low-Resource Language Models [0.0]
We show that an optimal Byte-Pair (BPE) configuration significantly reduces token count compared to greedy segmentation.<n>Our findings suggest that compression-optimized tokenization strategies could provide substantial advantages for multilingual and low-resource language applications.
arXiv Detail & Related papers (2024-12-09T19:11:54Z) - Batching BPE Tokenization Merges [55.2480439325792]
BatchBPE is an open-source pure Python implementation of the Byte Pair algorithm.
It is used to train a high quality tokenizer on a basic laptop.
arXiv Detail & Related papers (2024-08-05T09:37:21Z) - MAGNET: Improving the Multilingual Fairness of Language Models with Adaptive Gradient-Based Tokenization [81.83460411131931]
In multilingual settings, non-Latin scripts and low-resource languages are usually disadvantaged in terms of language models' utility, efficiency, and cost.
We propose multilingual adaptive gradient-based tokenization to reduce over-segmentation via adaptive gradient-based subword tokenization.
arXiv Detail & Related papers (2024-07-11T18:59:21Z) - Tokenization Is More Than Compression [14.939912120571728]
Existing tokenization approaches like Byte-Pair.
(BPE) originate from the field of data compression.
We introduce PathPiece, a new tokenizer that segments a document's text into the minimum number of tokens for a given vocabulary.
arXiv Detail & Related papers (2024-02-28T14:52:15Z) - Byte Pair Encoding Is All You Need For Automatic Bengali Speech
Recognition [0.0]
Byte pair encoding (BPE) emerges as an effective tokenization method for tackling the out-of-vocabulary (OOV) challenge.
Recent research highlights the dependency of BPE subword tokenization's efficacy on the morphological nature of the language.
Our study empirically identifies the optimal number of BPE tokens for Bengali, a language known for its morphological complexity.
arXiv Detail & Related papers (2024-01-28T00:41:21Z) - SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural
Machine Translation [51.881877192924414]
Sub-word segmentation is an essential pre-processing step for Neural Machine Translation (NMT)
This paper introduces SelfSeg, a self-supervised neural sub-word segmentation method.
SelfSeg is much faster to train/decode and requires only monolingual dictionaries instead of parallel corpora.
arXiv Detail & Related papers (2023-07-31T04:38:47Z) - Neural Token Segmentation for High Token-Internal Complexity [7.569526565230962]
Tokenizing raw texts into word units is an essential pre-processing step for NLP pipelines.
We propose a novel neural segmentation model which combines contextualised token representation and char-level decoding.
Our model shows substantial improvements in segmentation accuracy on Hebrew and Arabic compared to the state-of-the-art.
arXiv Detail & Related papers (2022-03-21T10:07:17Z) - Dynamic Programming Encoding for Subword Segmentation in Neural Machine
Translation [80.38621085548013]
This paper introduces Dynamic Programming (DPE) a new segmentation algorithm for tokenizing sentences into subword units.
A mixed character-subword transformer is proposed, which enables exact log marginal likelihood estimation and exact MAP inference to find target segmentations.
arXiv Detail & Related papers (2020-05-03T05:00:50Z) - Byte Pair Encoding is Suboptimal for Language Model Pretraining [49.30780227162387]
We analyze differences between unigram LM tokenization and byte-pair encoding (BPE)
We find that the unigram LM tokenization method matches or outperforms BPE across downstream tasks and two languages.
We hope that developers of future pretrained LMs will consider adopting the unigram LM method over the more prevalent BPE.
arXiv Detail & Related papers (2020-04-07T21:21:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.