Related papers: Multilingual Tokenization through the Lens of Indian Languages: Challenges and Insights

Multilingual Tokenization through the Lens of Indian Languages: Challenges and Insights

URL: http://arxiv.org/abs/2506.17789v2
Date: Tue, 24 Jun 2025 09:35:36 GMT
Title: Multilingual Tokenization through the Lens of Indian Languages: Challenges and Insights
Authors: N J Karthika, Maharaj Brahma, Rohit Saluja, Ganesh Ramakrishnan, Maunendra Sankar Desarkar,
Abstract summary: This paper presents an intrinsic evaluation of tokenization strategies across 17 Indian languages.<n>We quantify the trade-offs between bottom-up and top-down tokenizer algorithms.<n>We show that extremely low-resource languages can benefit from tokenizers trained on related high-resource languages.
Score: 27.369278566345074
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Tokenization plays a pivotal role in multilingual NLP. However, existing tokenizers are often skewed towards high-resource languages, limiting their effectiveness for linguistically diverse and morphologically rich languages such as those in the Indian subcontinent. This paper presents a comprehensive intrinsic evaluation of tokenization strategies across 17 Indian languages. We quantify the trade-offs between bottom-up and top-down tokenizer algorithms (BPE and Unigram LM), effects of vocabulary sizes, and compare strategies of multilingual vocabulary construction such as joint and cluster-based training. We also show that extremely low-resource languages can benefit from tokenizers trained on related high-resource languages. Our study provides practical insights for building more fair, efficient, and linguistically informed tokenizers for multilingual NLP.

Related papers

Tokenization Matters: Improving Zero-Shot NER for Indic Languages [2.964265227875254]
Tokenization is a critical component of Natural Language Processing (NLP)<n>This work systematically compares BPE, SentencePiece, and Character Level tokenization strategies using Indic languages.<n>Results show that SentencePiece is a consistently better performing approach than BPE for NER in low resource Indic languages.
arXiv Detail & Related papers (2025-04-23T17:28:38Z)
CoCo-CoLa: Evaluating and Improving Language Adherence in Multilingual LLMs [1.2057938662974816]
Large Language Models (LLMs) develop cross-lingual abilities despite being trained on limited parallel data.<n>We introduce CoCo-CoLa, a novel metric to evaluate language adherence in multilingual LLMs.
arXiv Detail & Related papers (2025-02-18T03:03:53Z)
How does a Multilingual LM Handle Multiple Languages? [0.0]
This study critically examines capabilities in multilingual understanding, semantic representation, and cross-lingual knowledge transfer.<n>It assesses semantic similarity by analyzing multilingual word embeddings for consistency using cosine similarity.<n>It examines BLOOM-1.7B and Qwen2 through Named Entity Recognition and sentence similarity tasks to understand their linguistic structures.
arXiv Detail & Related papers (2025-02-06T18:08:14Z)
Lens: Rethinking Multilingual Enhancement for Large Language Models [70.85065197789639]
We propose Lens, a novel approach to enhance multilingual capabilities in large language models (LLMs)<n>Lens operates on two subspaces: the language-agnostic subspace, where it aligns target languages with the central language to inherit strong semantic representations, and the language-specific subspace, where it separates target and central languages to preserve linguistic specificity.<n>Lens significantly improves multilingual performance while maintaining the model's English proficiency, achieving better results with less computational cost compared to existing post-training approaches.
arXiv Detail & Related papers (2024-10-06T08:51:30Z)
Parrot: Multilingual Visual Instruction Tuning [66.65963606552839]
Existing methods typically align vision encoders with Multimodal Large Language Models (MLLMs) via supervised fine-tuning (SFT)<n>We propose PARROT, a novel approach that leverages textual guidance for visual token alignment at the language level.<n>We introduce the Massive Multilingual Multimodal Benchmark (MMMB), a new benchmark comprising 6 languages, 15 categories, and 12,000 questions.
arXiv Detail & Related papers (2024-06-04T17:56:28Z)
Multilingual Evaluation of Semantic Textual Relatedness [0.0]
Semantic Textual Relatedness (STR) goes beyond superficial word overlap, considering linguistic elements and non-linguistic factors like topic, sentiment, and perspective. Prior NLP research has predominantly focused on English, limiting its applicability across languages. We explore STR in Marathi, Hindi, Spanish, and English, unlocking the potential for information retrieval, machine translation, and more.
arXiv Detail & Related papers (2024-04-13T17:16:03Z)
How do Large Language Models Handle Multilingualism? [81.15060972112563]
This study explores how large language models (LLMs) handle multilingualism. LLMs initially understand the query, converting multilingual inputs into English for task-solving. In the intermediate layers, they employ English for thinking and incorporate multilingual knowledge with self-attention and feed-forward structures.
arXiv Detail & Related papers (2024-02-29T02:55:26Z)
Tokenization Impacts Multilingual Language Modeling: Assessing Vocabulary Allocation and Overlap Across Languages [3.716965622352967]
We propose new criteria to evaluate the quality of lexical representation and vocabulary overlap observed in sub-word tokenizers. Our findings show that the overlap of vocabulary across languages can be actually detrimental to certain downstream tasks.
arXiv Detail & Related papers (2023-05-26T18:06:49Z)
Romanization-based Large-scale Adaptation of Multilingual Language Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP. We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages. Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z)
Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis. We cluster all the target languages into multiple groups and name each group as a representation sprachbund. Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z)
Looking for Clues of Language in Multilingual BERT to Improve Cross-lingual Generalization [56.87201892585477]
Token embeddings in multilingual BERT (m-BERT) contain both language and semantic information. We control the output languages of multilingual BERT by manipulating the token embeddings.
arXiv Detail & Related papers (2020-10-20T05:41:35Z)
Bridging Linguistic Typology and Multilingual Machine Translation with Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source. We observe that our representations embed typology and strengthen correlations with language relationships. We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z)
Cross-lingual, Character-Level Neural Morphological Tagging [57.0020906265213]
We train character-level recurrent neural taggers to predict morphological taggings for high-resource languages and low-resource languages together.<n>Learning joint character representations among multiple related languages successfully enables knowledge transfer from the high-resource languages to the low-resource ones, improving accuracy by up to 30% over a monolingual model.
arXiv Detail & Related papers (2017-08-30T08:14:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.