Losing our Tail -- Again: On (Un)Natural Selection And Multilingual Large Language Models
- URL: http://arxiv.org/abs/2507.03933v2
- Date: Wed, 09 Jul 2025 13:14:29 GMT
- Title: Losing our Tail -- Again: On (Un)Natural Selection And Multilingual Large Language Models
- Authors: Eva Vanmassenhove,
- Abstract summary: I argue that the tails of our linguistic distributions are vanishing, and with them, the narratives and identities they carry.<n>This is a call to resist linguistic flattening and to reimagine NLP as a field that encourages, values and protects expressive multilingual lexical and linguistic diversity and creativity.
- Score: 0.8702432681310399
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multilingual Large Language Models (LLMs) considerably changed how technologies can influence language. While previous technologies could mediate or assist humans, there is now a tendency to offload the task of writing itself to these technologies, enabling them to change our linguistic ecosystem more directly. While they provide us quick access to information and impressively fluent output, beneath their apparent sophistication lies a subtle, more insidious threat: the gradual decline and loss of linguistic diversity. With this opinion piece, I explore how model collapse, with a particular focus on translation technology, can lead to the loss of linguistic forms, grammatical features, and cultural nuance. Model collapse refers to the eventual consequence of self-consuming training loops, where models reinforce their own biases and lose linguistic diversity. Drawing on recent work in Computer Vision, Natural Language Processing (NLP) and Machine Translation (MT), I argue that the tails of our linguistic distributions are vanishing, and with them, the narratives and identities they carry. This is a call to resist linguistic flattening and to reimagine NLP as a field that encourages, values and protects expressive multilingual lexical and linguistic diversity and creativity.
Related papers
- Sparse Auto-Encoder Interprets Linguistic Features in Large Language Models [40.12943080113246]
We present a systematic and comprehensive causal investigation using sparse auto-encoders (SAEs)<n>We extract a wide range of linguistic features from six dimensions.<n>We introduce two indices-Feature Representation Confidence (FRC) and Feature Intervention Confidence (FIC)-to measure the ability of linguistic features to capture and control linguistic phenomena.
arXiv Detail & Related papers (2025-02-27T18:16:47Z) - The Shrinking Landscape of Linguistic Diversity in the Age of Large Language Models [7.811355338367627]
We show that the widespread adoption of large language models (LLMs) as writing assistants is linked to notable declines in linguistic diversity.<n>We show that while the core content of texts is retained when LLMs polish and rewrite texts, not only do they homogenize writing styles, but they also alter stylistic elements in a way that selectively amplifies certain dominant characteristics or biases while suppressing others.
arXiv Detail & Related papers (2025-02-16T20:51:07Z) - Generative AI and Large Language Models in Language Preservation: Opportunities and Challenges [0.0]
Generative AI (GenAI) and Large Language Models (LLMs) unlock new frontiers in automating corpus creation, transcription, translation, and tutoring.<n>This paper introduces a novel analytical framework that systematically evaluates GenAI applications against language-specific needs.<n>We demonstrate its efficacy through the Te Reo M=aori revitalization, where it illuminates successes, such as community-led Automatic Speech Recognition achieving 92% accuracy.<n>Our findings underscore that GenAI can indeed revolutionize language preservation, but only when interventions are rigorously anchored in community-centric data stewardship, continuous evaluation, and transparent risk management.
arXiv Detail & Related papers (2025-01-20T14:03:40Z) - LIMBA: An Open-Source Framework for the Preservation and Valorization of Low-Resource Languages using Generative Models [62.47865866398233]
This white paper proposes a framework to generate linguistic tools for low-resource languages.
By addressing the data scarcity that hinders intelligent applications for such languages, we contribute to promoting linguistic diversity.
arXiv Detail & Related papers (2024-11-20T16:59:41Z) - Lens: Rethinking Multilingual Enhancement for Large Language Models [70.85065197789639]
We propose Lens, a novel approach to enhance multilingual capabilities in large language models (LLMs)<n>Lens operates on two subspaces: the language-agnostic subspace, where it aligns target languages with the central language to inherit strong semantic representations, and the language-specific subspace, where it separates target and central languages to preserve linguistic specificity.<n>Lens significantly improves multilingual performance while maintaining the model's English proficiency, achieving better results with less computational cost compared to existing post-training approaches.
arXiv Detail & Related papers (2024-10-06T08:51:30Z) - Breaking Boundaries: Investigating the Effects of Model Editing on Cross-linguistic Performance [6.907734681124986]
This paper strategically identifies the need for linguistic equity by examining several knowledge editing techniques in multilingual contexts.<n>We evaluate the performance of models such as Mistral, TowerInstruct, OpenHathi, Tamil-Llama, and Kan-Llama across languages including English, German, French, Italian, Spanish, Hindi, Tamil, and Kannada.
arXiv Detail & Related papers (2024-06-17T01:54:27Z) - Multilingual Text Representation [3.4447129363520337]
Modern NLP breakthrough includes large multilingual models capable of performing tasks across more than 100 languages.
State-of-the-art language models came a long way, starting from the simple one-hot representation of words.
We discuss how the full potential of language democratization could be obtained, reaching beyond the known limits.
arXiv Detail & Related papers (2023-09-02T14:21:22Z) - Diffusion Language Models Can Perform Many Tasks with Scaling and Instruction-Finetuning [52.22611035186903]
We show that scaling diffusion language models can effectively make them strong language learners.<n>We build competent diffusion language models at scale by first acquiring knowledge from massive data.<n>Experiments show that scaling diffusion language models consistently improves performance across downstream language tasks.
arXiv Detail & Related papers (2023-08-23T16:01:12Z) - Towards Bridging the Digital Language Divide [4.234367850767171]
multilingual language processing systems often exhibit a hardwired, yet usually involuntary and hidden representational preference towards certain languages.
We show that biased technology is often the result of research and development methodologies that do not do justice to the complexity of the languages being represented.
We present a new initiative that aims at reducing linguistic bias through both technological design and methodology.
arXiv Detail & Related papers (2023-07-25T10:53:20Z) - Transparency Helps Reveal When Language Models Learn Meaning [71.96920839263457]
Our systematic experiments with synthetic data reveal that, with languages where all expressions have context-independent denotations, both autoregressive and masked language models learn to emulate semantic relations between expressions.
Turning to natural language, our experiments with a specific phenomenon -- referential opacity -- add to the growing body of evidence that current language models do not well-represent natural language semantics.
arXiv Detail & Related papers (2022-10-14T02:35:19Z) - Towards Zero-shot Language Modeling [90.80124496312274]
We construct a neural model that is inductively biased towards learning human languages.
We infer this distribution from a sample of typologically diverse training languages.
We harness additional language-specific side information as distant supervision for held-out languages.
arXiv Detail & Related papers (2021-08-06T23:49:18Z) - Bridging Linguistic Typology and Multilingual Machine Translation with
Multi-View Language Representations [83.27475281544868]
We use singular vector canonical correlation analysis to study what kind of information is induced from each source.
We observe that our representations embed typology and strengthen correlations with language relationships.
We then take advantage of our multi-view language vector space for multilingual machine translation, where we achieve competitive overall translation accuracy.
arXiv Detail & Related papers (2020-04-30T16:25:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.