Related papers: From Data Scarcity to Data Care: Reimagining Language Technologies for Serbian and other Low-Resource Languages

From Data Scarcity to Data Care: Reimagining Language Technologies for Serbian and other Low-Resource Languages

URL: http://arxiv.org/abs/2512.10630v1
Date: Thu, 11 Dec 2025 13:29:25 GMT
Title: From Data Scarcity to Data Care: Reimagining Language Technologies for Serbian and other Low-Resource Languages
Authors: Smiljana Antonijevic Ubois,
Abstract summary: This study examines the structural, historical, and sociotechnical factors shaping language technology development for low resource languages in the AI age.<n>It traces challenges rooted in historical destruction of Serbian textual heritage, intensified by contemporary issues.<n>To address these challenges, the study proposes Data Care, a framework grounded in CARE principles.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large language models are commonly trained on dominant languages like English, and their representation of low resource languages typically reflects cultural and linguistic biases present in the source language materials. Using the Serbian language as a case, this study examines the structural, historical, and sociotechnical factors shaping language technology development for low resource languages in the AI age. Drawing on semi structured interviews with ten scholars and practitioners, including linguists, digital humanists, and AI developers, it traces challenges rooted in historical destruction of Serbian textual heritage, intensified by contemporary issues that drive reductive, engineering first approaches prioritizing functionality over linguistic nuance. These include superficial transliteration, reliance on English-trained models, data bias, and dataset curation lacking cultural specificity. To address these challenges, the study proposes Data Care, a framework grounded in CARE principles (Collective Benefit, Authority to Control, Responsibility, and Ethics), that reframes bias mitigation from a post hoc technical fix to an integral component of corpus design, annotation, and governance, and positions Data Care as a replicable model for building inclusive, sustainable, and culturally grounded language technologies in contexts where traditional LLM development reproduces existing power imbalances and cultural blind spots.

Related papers

PLLuM: A Family of Polish Large Language Models [91.61661675434216]
We presentuM (Polish Large Language Model), the largest open-source family of foundation models tailored specifically for the Polish language.<n>We describe the development process, including the construction of a new 140-billion-token Polish text corpus for pre-training.<n>We detail the models' architecture, training procedures, and alignment techniques for both base and instruction-tuned variants.
arXiv Detail & Related papers (2025-11-05T19:41:49Z)
Overcoming Data Scarcity in Generative Language Modelling for Low-Resource Languages: A Systematic Review [0.7366405857677227]
This paper focuses on strategies to address data scarcity in generative language modelling for low-resource languages (LRL)<n>We identify, categorise and evaluate technical approaches, including monolingual data augmentation, back-translation, multilingual training, and prompt engineering.<n>We conclude with recommendations for extending these methods to a wider range of LRLs and outline open challenges in building equitable generative language systems.
arXiv Detail & Related papers (2025-05-07T16:04:45Z)
Preserving Cultural Identity with Context-Aware Translation Through Multi-Agent AI Systems [0.4218593777811082]
Language is a cornerstone of cultural identity, yet globalization and the dominance of major languages have placed nearly 3,000 languages at risk of extinction.<n>Existing AI-driven translation models prioritize efficiency but often fail to capture cultural nuances, idiomatic expressions, and historical significance.<n>We propose a multi-agent AI framework designed for culturally adaptive translation in underserved language communities.
arXiv Detail & Related papers (2025-03-05T06:43:59Z)
Opportunities and Challenges of Large Language Models for Low-Resource Languages in Humanities Research [32.14802247608518]
Low-resource languages serve as invaluable repositories of human history, embodying cultural evolution and intellectual diversity.<n>Despite their significance, these languages face critical challenges, including data scarcity and technological limitations.<n>Recent advancements in large language models (LLMs) offer transformative opportunities for addressing these challenges.
arXiv Detail & Related papers (2024-11-30T00:10:56Z)
LIMBA: An Open-Source Framework for the Preservation and Valorization of Low-Resource Languages using Generative Models [62.47865866398233]
This white paper proposes a framework to generate linguistic tools for low-resource languages. By addressing the data scarcity that hinders intelligent applications for such languages, we contribute to promoting linguistic diversity.
arXiv Detail & Related papers (2024-11-20T16:59:41Z)
Trustworthy Alignment of Retrieval-Augmented Large Language Models via Reinforcement Learning [84.94709351266557]
We focus on the trustworthiness of language models with respect to retrieval augmentation. We deem that retrieval-augmented language models have the inherent capabilities of supplying response according to both contextual and parametric knowledge. Inspired by aligning language models with human preference, we take the first step towards aligning retrieval-augmented language models to a status where it responds relying merely on the external evidence.
arXiv Detail & Related papers (2024-10-22T09:25:21Z)
Recent Advancements and Challenges of Turkic Central Asian Language Processing [4.189204855014775]
Research in NLP for Central Asian Turkic languages faces typical low-resource language challenges. Recent advancements have included the collection of language-specific datasets and the development of models for downstream tasks.
arXiv Detail & Related papers (2024-07-06T08:58:26Z)
Massively Multi-Cultural Knowledge Acquisition & LM Benchmarking [48.21982147529661]
This paper introduces a novel approach for massively multicultural knowledge acquisition. Our method strategically navigates from densely informative Wikipedia documents on cultural topics to an extensive network of linked pages. Our work marks an important step towards deeper understanding and bridging the gaps of cultural disparities in AI.
arXiv Detail & Related papers (2024-02-14T18:16:54Z)
History, Development, and Principles of Large Language Models-An Introductory Survey [15.875687167037206]
Language models serve as a cornerstone in natural language processing (NLP) Over extensive research spanning decades, language modeling has progressed from initial statistical language models (SLMs) to the contemporary landscape of large language models (LLMs)
arXiv Detail & Related papers (2024-02-10T01:18:15Z)
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z)
Reinforced Iterative Knowledge Distillation for Cross-Lingual Named Entity Recognition [54.92161571089808]
Cross-lingual NER transfers knowledge from rich-resource language to languages with low resources. Existing cross-lingual NER methods do not make good use of rich unlabeled data in target languages. We develop a novel approach based on the ideas of semi-supervised learning and reinforcement learning.
arXiv Detail & Related papers (2021-06-01T05:46:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.