Related papers: Comparative Analysis of Tokenization Algorithms for Low-Resource Language Dzongkha

Comparative Analysis of Tokenization Algorithms for Low-Resource Language Dzongkha

URL: http://arxiv.org/abs/2509.15255v1
Date: Thu, 18 Sep 2025 07:02:55 GMT
Title: Comparative Analysis of Tokenization Algorithms for Low-Resource Language Dzongkha
Authors: Tandin Wangchuk, Tad Gonsalves,
Abstract summary: Most pre-trained tokenizers are suitable for high-resource languages like English but perform poorly for low-resource languages.<n>This study evaluates the training and performance of three common tokenization algorithms in comparison to other popular methods.<n>The results show that while all three algorithms demonstrate potential, SentencePiece is the most effective for Dzongkha tokenization.
Score: 0.1019561860229868
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) are gaining popularity and improving rapidly. Tokenizers are crucial components of natural language processing, especially for LLMs. Tokenizers break down input text into tokens that models can easily process while ensuring the text is accurately represented, capturing its meaning and structure. Effective tokenizers enhance the capabilities of LLMs by improving a model's understanding of context and semantics, ultimately leading to better performance in various downstream tasks, such as translation, classification, sentiment analysis, and text generation. Most pre-trained tokenizers are suitable for high-resource languages like English but perform poorly for low-resource languages. Dzongkha, Bhutan's national language spoken by around seven hundred thousand people, is a low-resource language, and its linguistic complexity poses unique NLP challenges. Despite some progress, significant research in Dzongkha NLP is lacking, particularly in tokenization. This study evaluates the training and performance of three common tokenization algorithms in comparison to other popular methods. Specifically, Byte-Pair Encoding (BPE), WordPiece, and SentencePiece (Unigram) were evaluated for their suitability for Dzongkha. Performance was assessed using metrics like Subword Fertility, Proportion of Continued Words, Normalized Sequence Length, and execution time. The results show that while all three algorithms demonstrate potential, SentencePiece is the most effective for Dzongkha tokenization, paving the way for further NLP advancements. This underscores the need for tailored approaches for low-resource languages and ongoing research. In this study, we presented three tokenization algorithms for Dzongkha, paving the way for building Dzongkha Large Language Models.

Related papers

What Language is This? Ask Your Tokenizer [32.28976119949841]
Language Identification (LID) is an important component of many multilingual natural language processing pipelines.<n>We introduce UniLID, a simple and efficient LID method based on the UnigramLM tokenization algorithm.<n>Our formulation is data- and compute-efficient, supports incremental addition of new languages without retraining existing models.
arXiv Detail & Related papers (2026-02-19T18:58:39Z)
IndicSuperTokenizer: An Optimized Tokenizer for Indic Multilingual LLMs [5.068673710249497]
IndicSuperTokenizer is a tokenizer for Indic multilingual LLMs.<n>It combines subword and multi-word tokenization, along with language-specific tokens pre-tokenization.<n>It improves the average fertility score by 39.5% over LLaMA4 and by 18% over Sutra.
arXiv Detail & Related papers (2025-11-05T06:57:42Z)
Tokenization Strategies for Low-Resource Agglutinative Languages in Word2Vec: Case Study on Turkish and Finnish [0.0]
Tokenization plays a critical role in processing agglutinative languages.<n>This study evaluates the impact of various tokenization strategies on the quality of static word embeddings.
arXiv Detail & Related papers (2025-08-27T22:01:11Z)
Tokens with Meaning: A Hybrid Tokenization Approach for NLP [0.2826977330147589]
Tokenization plays a pivotal role in natural language processing (NLP)<n>We introduce a hybrid tokenization framework that combines rule-based morphological analysis with statistical subword segmentation.<n>The method uses phono normalization, root-affix, and a novel algorithm that balances morpheme preservation with vocabulary efficiency.
arXiv Detail & Related papers (2025-08-19T22:17:42Z)
Evaluating LLMs' Multilingual Capabilities for Bengali: Benchmark Creation and Performance Analysis [0.0]
Bengali is an underrepresented language in NLP research.<n>We systematically investigate the challenges that hinder Bengali NLP performance.<n>Our findings reveal consistent performance gaps for Bengali compared to English.
arXiv Detail & Related papers (2025-07-31T05:16:43Z)
Tokenization Matters: Improving Zero-Shot NER for Indic Languages [2.964265227875254]
Tokenization is a critical component of Natural Language Processing (NLP)<n>This work systematically compares BPE, SentencePiece, and Character Level tokenization strategies using Indic languages.<n>Results show that SentencePiece is a consistently better performing approach than BPE for NER in low resource Indic languages.
arXiv Detail & Related papers (2025-04-23T17:28:38Z)
Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning [53.57895922042783]
Large Language Models (LLMs) excel at reasoning and planning when trained on chainof-thought (CoT) data.<n>We propose a hybrid representation of the reasoning process, where we partially abstract away the initial reasoning steps using latent discrete tokens.
arXiv Detail & Related papers (2025-02-05T15:33:00Z)
Signs as Tokens: A Retrieval-Enhanced Multilingual Sign Language Generator [55.94334001112357]
We introduce a multilingual sign language model, Signs as Tokens (SOKE), which can generate 3D sign avatars autoregressively from text inputs.<n>We propose a retrieval-enhanced SLG approach, which incorporates external sign dictionaries to provide accurate word-level signs.
arXiv Detail & Related papers (2024-11-26T18:28:09Z)
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z)
Better Than Whitespace: Information Retrieval for Languages without Custom Tokenizers [48.036317742487796]
We propose a new approach to tokenization for lexical matching retrieval algorithms. We use the WordPiece tokenizer, which can be built automatically from unsupervised data. Results show that the mBERT tokenizer provides strong relevance signals for retrieval "out of the box", outperforming whitespace tokenization on most languages.
arXiv Detail & Related papers (2022-10-11T14:32:46Z)
SHUOWEN-JIEZI: Linguistically Informed Tokenizers For Chinese Language Model Pretraining [48.880840711568425]
We study the influences of three main factors on the Chinese tokenization for pretrained language models. We propose three kinds of tokenizers: SHUOWEN (meaning Talk Word), the pronunciation-based tokenizers; 2) JIEZI (meaning Solve Character), the glyph-based tokenizers. We find that SHUOWEN and JIEZI tokenizers can generally outperform conventional single-character tokenizers.
arXiv Detail & Related papers (2021-06-01T11:20:02Z)
Building Low-Resource NER Models Using Non-Speaker Annotation [58.78968578460793]
Cross-lingual methods have had notable success in addressing these concerns. We propose a complementary approach to building low-resource Named Entity Recognition (NER) models using non-speaker'' (NS) annotations. We show that use of NS annotators produces results that are consistently on par or better than cross-lingual methods built on modern contextual representations.
arXiv Detail & Related papers (2020-06-17T03:24:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.