Related papers: Tokenization Disparities as Infrastructure Bias: How Subword Systems Create Inequities in LLM Access and Efficiency

Tokenization Disparities as Infrastructure Bias: How Subword Systems Create Inequities in LLM Access and Efficiency

URL: http://arxiv.org/abs/2510.12389v1
Date: Tue, 14 Oct 2025 11:14:38 GMT
Title: Tokenization Disparities as Infrastructure Bias: How Subword Systems Create Inequities in LLM Access and Efficiency
Authors: Hailay Kidu Teklehaymanot, Wolfgang Nejdl,
Abstract summary: Tokenization disparities pose a significant barrier to achieving equitable access to artificial intelligence.<n>This study conducts a large-scale cross-linguistic evaluation of tokenization efficiency in over 200 languages.
Score: 6.943451388015595
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Tokenization disparities pose a significant barrier to achieving equitable access to artificial intelligence across linguistically diverse populations. This study conducts a large-scale cross-linguistic evaluation of tokenization efficiency in over 200 languages to systematically quantify computational inequities in large language models (LLMs). Using a standardized experimental framework, we applied consistent preprocessing and normalization protocols, followed by uniform tokenization through the tiktoken library across all language samples. Comprehensive tokenization statistics were collected using established evaluation metrics, including Tokens Per Sentence (TPS) and Relative Tokenization Cost (RTC), benchmarked against English baselines. Our cross-linguistic analysis reveals substantial and systematic disparities: Latin-script languages consistently exhibit higher tokenization efficiency, while non-Latin and morphologically complex languages incur significantly greater token inflation, often 3-5 times higher RTC ratios. These inefficiencies translate into increased computational costs and reduced effective context utilization for underrepresented languages. Overall, the findings highlight structural inequities in current AI systems, where speakers of low-resource and non-Latin languages face disproportionate computational disadvantages. Future research should prioritize the development of linguistically informed tokenization strategies and adaptive vocabulary construction methods that incorporate typological diversity, ensuring more inclusive and computationally equitable multilingual AI systems.

Related papers

What Language is This? Ask Your Tokenizer [32.28976119949841]
Language Identification (LID) is an important component of many multilingual natural language processing pipelines.<n>We introduce UniLID, a simple and efficient LID method based on the UnigramLM tokenization algorithm.<n>Our formulation is data- and compute-efficient, supports incremental addition of new languages without retraining existing models.
arXiv Detail & Related papers (2026-02-19T18:58:39Z)
Tokenization and Morphological Fidelity in Uralic NLP: A Cross-Lingual Evaluation [9.23725598061561]
This study systematically compares three subword paradigms -- Byte Pair.<n>(BPE), Overlap BPE (OBPE), and Unigram Language Model -- across six Uralic languages.<n>We show OBPE consistently achieves stronger morphological alignment and higher tagging accuracy than conventional methods.
arXiv Detail & Related papers (2026-02-04T05:59:25Z)
Parallel Universes, Parallel Languages: A Comprehensive Study on LLM-based Multilingual Counterfactual Example Generation [49.2073409243885]
Large language models (LLMs) excel at generating English counterfactuals and demonstrate multilingual proficiency.<n>We conduct automatic evaluations on both directly generated counterfactuals in the target languages and those derived via English translation across six languages.<n>We identify and categorize four main types of errors that consistently appear in the generated counterfactuals across languages.
arXiv Detail & Related papers (2026-01-01T08:53:49Z)
Code-Switching In-Context Learning for Cross-Lingual Transfer of Large Language Models [64.54005959758733]
We introduce code-switching in-context learning (CSICL) as a principled and robust approach for overcoming the translation barrier during inference.<n>We conduct extensive experiments across 4 LLMs, 6 datasets, and 10 languages, spanning both knowledge-intensive and reasoning-oriented domains.<n>Our results demonstrate CSICL consistently outperforms X-ICL baselines, achieving gains of 3.1%p and 1.9%p in both target and unseen languages.
arXiv Detail & Related papers (2025-10-07T08:35:42Z)
Parallel Scaling Law: Unveiling Reasoning Generalization through A Cross-Linguistic Perspective [52.452449102961225]
This study proposes a novel cross-linguistic perspective to investigate reasoning generalization.<n>Our findings reveal that cross-lingual transferability varies significantly across initial model, target language, and training paradigm.<n>Our study challenges the assumption that LRM reasoning mirrors human cognition, providing critical insights for the development of more language-agnostic LRMs.
arXiv Detail & Related papers (2025-10-02T17:49:49Z)
Doğal Dil İşlemede Tokenizasyon Standartları ve Ölçümü: Türkçe Üzerinden Büyük Dil Modellerinin Karşılaştırmalı Analizi [0.29687381456163997]
This study introduces a novel evaluation framework addressing tokenization challenges specific to morphologically-rich and low-resource languages such as Turkish.<n>We assessed tokenizers based on vocabulary size, token count, processing time, language-specific token percentages (%TR), and token purity (%Pure)<n>Our analysis reveals that language-specific token percentages exhibit a stronger correlation with downstream performance (e.g., MMLU scores) than token purity.
arXiv Detail & Related papers (2025-08-18T16:26:42Z)
The Art of Breaking Words: Rethinking Multilingual Tokenizer Design [21.9940001977516]
Existing tokenizers exhibit high token-to-word ratios, inefficient use of context length, and slower inference.<n>We present a systematic study that links vocabulary size, pre-tokenization rules, and training-corpus composition to both token-to-word efficiency and model quality.<n>Our tokenizer achieves more than 40% improvement on average token-to-word ratio against stateof-the-art multilingual Indic models.
arXiv Detail & Related papers (2025-08-03T15:31:10Z)
Mechanistic Understanding and Mitigation of Language Confusion in English-Centric Large Language Models [56.61984030508691]
We present the first mechanistic interpretability study of language confusion.<n>We show that confusion points (CPs) are central to this phenomenon.<n>We show that editing a small set of critical neurons, identified via comparative analysis with a multilingual-tuned counterpart, substantially mitigates confusion.
arXiv Detail & Related papers (2025-05-22T11:29:17Z)
Tokenization Standards for Linguistic Integrity: Turkish as a Benchmark [0.29687381456163997]
Tokenization is a fundamental preprocessing step in NLP, directly impacting large language models' ability to capture syntactic, morphosyntactic, and semantic structures.<n>This paper introduces a novel framework for evaluating tokenization strategies, addressing challenges in morphologically rich and low-resource languages.
arXiv Detail & Related papers (2025-02-10T21:47:49Z)
Towards Linguistically-Aware and Language-Independent Tokenization for Large Language Models (LLMs) [0.09374652839580183]
This paper presents a study on the tokenization techniques employed by state-of-the-art large language models (LLMs) The study evaluates the tokenization variability observed across these models and investigates the challenges of linguistic representation in subword tokenization. This research aims to promote generalizable Internationalization (I18N) practices in the development of AI services in this domain and beyond.
arXiv Detail & Related papers (2024-10-04T16:18:29Z)
Improving Classifier Training Efficiency for Automatic Cyberbullying Detection with Feature Density [58.64907136562178]
We study the effectiveness of Feature Density (FD) using different linguistically-backed feature preprocessing methods. We hypothesise that estimating dataset complexity allows for the reduction of the number of required experiments. The difference in linguistic complexity of datasets allows us to additionally discuss the efficacy of linguistically-backed word preprocessing.
arXiv Detail & Related papers (2021-11-02T15:48:28Z)
Evaluating Multilingual Text Encoders for Unsupervised Cross-Lingual Retrieval [51.60862829942932]
We present a systematic empirical study focused on the suitability of the state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks. For sentence-level CLIR, we demonstrate that state-of-the-art performance can be achieved. However, the peak performance is not met using the general-purpose multilingual text encoders off-the-shelf', but rather relying on their variants that have been further specialized for sentence understanding tasks.
arXiv Detail & Related papers (2021-01-21T00:15:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.