Related papers: Invisible Languages of the LLM Universe

Invisible Languages of the LLM Universe

URL: http://arxiv.org/abs/2510.11557v1
Date: Mon, 13 Oct 2025 16:00:15 GMT
Title: Invisible Languages of the LLM Universe
Authors: Saurabh Khanna, Xinxu Li,
Abstract summary: 2,000 languages with millions of speakers remain effectively invisible in digital ecosystems.<n>Our analysis reveals that English dominance in AI is not a technical necessity but an artifact of power structures that systematically exclude marginalized linguistic knowledge.
Score: 1.2891210250935148
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models are trained on massive multilingual corpora, yet this abundance masks a profound crisis: of the world's 7,613 living languages, approximately 2,000 languages with millions of speakers remain effectively invisible in digital ecosystems. We propose a critical framework connecting empirical measurements of language vitality (real world demographic strength) and digitality (online presence) with postcolonial theory and epistemic injustice to explain why linguistic inequality in AI systems is not incidental but structural. Analyzing data across all documented human languages, we identify four categories: Strongholds (33%, high vitality and digitality), Digital Echoes (6%, high digitality despite declining vitality), Fading Voices (36%, low on both dimensions), and critically, Invisible Giants (27%, high vitality but near-zero digitality) - languages spoken by millions yet absent from the LLM universe. We demonstrate that these patterns reflect continuities from colonial-era linguistic hierarchies to contemporary AI development, constituting what we term digital epistemic injustice. Our analysis reveals that English dominance in AI is not a technical necessity but an artifact of power structures that systematically exclude marginalized linguistic knowledge. We conclude with implications for decolonizing language technology and democratizing access to AI benefits.

Related papers

Artificial intelligence is creating a new global linguistic hierarchy [34.80252741178931]
We present a global longitudinal analysis of social, economic and infrastructural conditions across languages.<n>We find that despite efforts to broaden the reach of language technologies, the dominance of a handful of languages is exacerbating disparities.<n>We introduce the Language AI Readiness Index (EQUATE), which maps the state of technological, socio-economic, and infrastructural prerequisites for AI deployment across languages.
arXiv Detail & Related papers (2026-02-12T14:50:44Z)
Losing our Tail -- Again: On (Un)Natural Selection And Multilingual Large Language Models [0.8702432681310399]
I argue that the tails of our linguistic distributions are vanishing, and with them, the narratives and identities they carry.<n>This is a call to resist linguistic flattening and to reimagine NLP as a field that encourages, values and protects expressive multilingual lexical and linguistic diversity and creativity.
arXiv Detail & Related papers (2025-07-05T07:36:49Z)
Preserving Cultural Identity with Context-Aware Translation Through Multi-Agent AI Systems [0.4218593777811082]
Language is a cornerstone of cultural identity, yet globalization and the dominance of major languages have placed nearly 3,000 languages at risk of extinction.<n>Existing AI-driven translation models prioritize efficiency but often fail to capture cultural nuances, idiomatic expressions, and historical significance.<n>We propose a multi-agent AI framework designed for culturally adaptive translation in underserved language communities.
arXiv Detail & Related papers (2025-03-05T06:43:59Z)
Lens: Rethinking Multilingual Enhancement for Large Language Models [70.85065197789639]
We propose Lens, a novel approach to enhance multilingual capabilities in large language models (LLMs)<n>Lens operates on two subspaces: the language-agnostic subspace, where it aligns target languages with the central language to inherit strong semantic representations, and the language-specific subspace, where it separates target and central languages to preserve linguistic specificity.<n>Lens significantly improves multilingual performance while maintaining the model's English proficiency, achieving better results with less computational cost compared to existing post-training approaches.
arXiv Detail & Related papers (2024-10-06T08:51:30Z)
Large Models of What? Mistaking Engineering Achievements for Human Linguistic Agency [0.11510009152620666]
We argue that claims regarding linguistic capabilities of Large Language Models (LLMs) are based on at least two unfounded assumptions. Language completeness assumes that a distinct and complete thing such as a natural language' exists. The assumption of data completeness relies on the belief that a language can be quantified and wholly captured by data.
arXiv Detail & Related papers (2024-07-11T18:06:01Z)
Language Model Alignment in Multilingual Trolley Problems [138.5684081822807]
Building on the Moral Machine experiment, we develop a cross-lingual corpus of moral dilemma vignettes in over 100 languages called MultiTP.<n>Our analysis explores the alignment of 19 different LLMs with human judgments, capturing preferences across six moral dimensions.<n>We discover significant variance in alignment across languages, challenging the assumption of uniform moral reasoning in AI systems.
arXiv Detail & Related papers (2024-07-02T14:02:53Z)
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z)
Towards Bridging the Digital Language Divide [4.234367850767171]
multilingual language processing systems often exhibit a hardwired, yet usually involuntary and hidden representational preference towards certain languages. We show that biased technology is often the result of research and development methodologies that do not do justice to the complexity of the languages being represented. We present a new initiative that aims at reducing linguistic bias through both technological design and methodology.
arXiv Detail & Related papers (2023-07-25T10:53:20Z)
Understanding Natural Language Understanding Systems. A Critical Analysis [91.81211519327161]
The development of machines that guillemotlefttalk like usguillemotright, also known as Natural Language Understanding (NLU) systems, is the Holy Grail of Artificial Intelligence (AI) But never has the trust that we can build guillemotlefttalking machinesguillemotright been stronger than the one engendered by the last generation of NLU systems. Are we at the dawn of a new era, in which the Grail is finally closer to us?
arXiv Detail & Related papers (2023-03-01T08:32:55Z)
Crossing the Conversational Chasm: A Primer on Multilingual Task-Oriented Dialogue Systems [51.328224222640614]
Current state-of-the-art ToD models based on large pretrained neural language models are data hungry. Data acquisition for ToD use cases is expensive and tedious.
arXiv Detail & Related papers (2021-04-17T15:19:56Z)
SIGMORPHON 2020 Shared Task 0: Typologically Diverse Morphological Inflection [81.85463892070085]
The SIGMORPHON 2020 task on morphological reinflection aims to investigate systems' ability to generalize across typologically distinct languages. Systems were developed using data from 45 languages and just 5 language families, fine-tuned with data from an additional 45 languages and 10 language families (13 in total), and evaluated on all 90 languages.
arXiv Detail & Related papers (2020-06-20T13:24:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.