Evaluating Subword Tokenization Techniques for Bengali: A Benchmark Study with BengaliBPE
- URL: http://arxiv.org/abs/2511.05324v1
- Date: Fri, 07 Nov 2025 15:23:32 GMT
- Title: Evaluating Subword Tokenization Techniques for Bengali: A Benchmark Study with BengaliBPE
- Authors: Firoj Ahmmed Patwary, Abdullah Al Noman,
- Abstract summary: BengaliBPE is a language-aware subword tokenizer for the Bengali script.<n>It applies Unicode normalization and morphology-aware merge rules to maintain linguistic consistency and preserve subword integrity.<n>It provides the most detailed segmentation and the best morphological interpretability, albeit with slightly higher computational cost.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Tokenization is an important first step in Natural Language Processing (NLP) pipelines because it decides how models learn and represent linguistic information. However, current subword tokenizers like SentencePiece or HuggingFace BPE are mostly designed for Latin or multilingual corpora and do not perform well on languages with rich morphology such as Bengali. To address this limitation, we present BengaliBPE, a Byte Pair Encoding (BPE) tokenizer specifically developed for the Bengali script. BengaliBPE applies Unicode normalization, grapheme-level initialization, and morphology-aware merge rules to maintain linguistic consistency and preserve subword integrity. We use a large-scale Bengali news classification dataset to compare BengaliBPE with three baselines: Whitespace, SentencePiece BPE, and HuggingFace BPE. The evaluation considers tokenization granularity, encoding speed, and downstream classification accuracy. While all methods perform reasonably well, BengaliBPE provides the most detailed segmentation and the best morphological interpretability, albeit with slightly higher computational cost. These findings highlight the importance of language-aware tokenization for morphologically rich scripts and establish BengaliBPE as a strong foundation for future Bengali NLP systems, including large-scale pretraining of contextual language models.
Related papers
- Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization [53.22544362024936]
Tokenization is the first -- and often least scrutinized -- step of most NLP pipelines.<n>Standard algorithms for learning tokenizers rely on frequency-based objectives.<n>We introduce Parity-aware Byte Pair.<n>We find empirically that Parity-aware BPE leads to more equitable token counts across languages.
arXiv Detail & Related papers (2025-08-06T18:14:43Z) - Tokenization Matters: Improving Zero-Shot NER for Indic Languages [2.964265227875254]
Tokenization is a critical component of Natural Language Processing (NLP)<n>This work systematically compares BPE, SentencePiece, and Character Level tokenization strategies using Indic languages.<n>Results show that SentencePiece is a consistently better performing approach than BPE for NER in low resource Indic languages.
arXiv Detail & Related papers (2025-04-23T17:28:38Z) - MorphTok: Morphologically Grounded Tokenization for Indian Languages [23.58043476541051]
Tokenization is a crucial step in NLP, especially with the rise of large language models (LLMs)<n>We propose morphology-aware segmentation as a pre-tokenization step prior to applying subword tokenization.<n>We also introduce Constrained BPE, an extension to the traditional BPE algorithm that incorporates script-specific constraints.
arXiv Detail & Related papers (2025-04-14T15:44:45Z) - SuperBPE: Space Travel for Language Models [103.09169510391972]
We introduce a "superword" tokenizer, SuperBPE, which incorporates a simple pretokenization curriculum into the byte-pair encoding (BPE) algorithm.<n>SuperBPE encodes a fixed piece of text with up to 33% fewer tokens than BPE on average.<n>Our model trained with SuperBPE achieves an average +4.0% absolute improvement over the BPE baseline across 30 downstream tasks.
arXiv Detail & Related papers (2025-03-17T17:53:23Z) - Batching BPE Tokenization Merges [55.2480439325792]
BatchBPE is an open-source pure Python implementation of the Byte Pair algorithm.
It is used to train a high quality tokenizer on a basic laptop.
arXiv Detail & Related papers (2024-08-05T09:37:21Z) - CoSTA: Code-Switched Speech Translation using Aligned Speech-Text Interleaving [61.73180469072787]
We focus on the problem of spoken translation (ST) of code-switched speech in Indian languages to English text.
We present a new end-to-end model architecture COSTA that scaffolds on pretrained automatic speech recognition (ASR) and machine translation (MT) modules.
COSTA significantly outperforms many competitive cascaded and end-to-end multimodal baselines by up to 3.5 BLEU points.
arXiv Detail & Related papers (2024-06-16T16:10:51Z) - Byte Pair Encoding Is All You Need For Automatic Bengali Speech
Recognition [0.0]
Byte pair encoding (BPE) emerges as an effective tokenization method for tackling the out-of-vocabulary (OOV) challenge.
Recent research highlights the dependency of BPE subword tokenization's efficacy on the morphological nature of the language.
Our study empirically identifies the optimal number of BPE tokens for Bengali, a language known for its morphological complexity.
arXiv Detail & Related papers (2024-01-28T00:41:21Z) - Continual Mixed-Language Pre-Training for Extremely Low-Resource Neural
Machine Translation [53.22775597051498]
We present a continual pre-training framework on mBART to effectively adapt it to unseen languages.
Results show that our method can consistently improve the fine-tuning performance upon the mBART baseline.
Our approach also boosts the performance on translation pairs where both languages are seen in the original mBART's pre-training.
arXiv Detail & Related papers (2021-05-09T14:49:07Z) - BNLP: Natural language processing toolkit for Bengali language [0.0]
BNLP is an open source language processing toolkit for Bengali language.
It consists of tokenization, word embedding, POS tagging, NER tagging facilities.
BNLP is using widely in the Bengali research communities with 16K downloads, 119 stars and 31 forks.
arXiv Detail & Related papers (2021-01-31T07:56:08Z) - Byte Pair Encoding is Suboptimal for Language Model Pretraining [49.30780227162387]
We analyze differences between unigram LM tokenization and byte-pair encoding (BPE)
We find that the unigram LM tokenization method matches or outperforms BPE across downstream tasks and two languages.
We hope that developers of future pretrained LMs will consider adopting the unigram LM method over the more prevalent BPE.
arXiv Detail & Related papers (2020-04-07T21:21:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.