Related papers: LIMBA: An Open-Source Framework for the Preservation and Valorization of Low-Resource Languages using Generative Models

LIMBA: An Open-Source Framework for the Preservation and Valorization of Low-Resource Languages using Generative Models

URL: http://arxiv.org/abs/2411.13453v1
Date: Wed, 20 Nov 2024 16:59:41 GMT
Title: LIMBA: An Open-Source Framework for the Preservation and Valorization of Low-Resource Languages using Generative Models
Authors: Salvatore Mario Carta, Stefano Chessa, Giulia Contu, Andrea Corriga, Andrea Deidda, Gianni Fenu, Luca Frigau, Alessandro Giuliani, Luca Grassi, Marco Manolo Manca, Mirko Marras, Francesco Mola, Bastianino Mossa, Piergiorgio Mura, Marco Ortu, Leonardo Piano, Simone Pisano, Alessia Pisu, Alessandro Sebastian Podda, Livio Pompianu, Simone Seu, Sandro Gabriele Tiddia,
Abstract summary: This white paper proposes a framework to generate linguistic tools for low-resource languages. By addressing the data scarcity that hinders intelligent applications for such languages, we contribute to promoting linguistic diversity.
Score: 62.47865866398233
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Minority languages are vital to preserving cultural heritage, yet they face growing risks of extinction due to limited digital resources and the dominance of artificial intelligence models trained on high-resource languages. This white paper proposes a framework to generate linguistic tools for low-resource languages, focusing on data creation to support the development of language models that can aid in preservation efforts. Sardinian, an endangered language, serves as the case study to demonstrate the framework's effectiveness. By addressing the data scarcity that hinders intelligent applications for such languages, we contribute to promoting linguistic diversity and support ongoing efforts in language standardization and revitalization through modern technologies.

Related papers

Overcoming Data Scarcity in Generative Language Modelling for Low-Resource Languages: A Systematic Review [0.7366405857677227]
This paper focuses on strategies to address data scarcity in generative language modelling for low-resource languages (LRL)<n>We identify, categorise and evaluate technical approaches, including monolingual data augmentation, back-translation, multilingual training, and prompt engineering.<n>We conclude with recommendations for extending these methods to a wider range of LRLs and outline open challenges in building equitable generative language systems.
arXiv Detail & Related papers (2025-05-07T16:04:45Z)
Generative AI and Large Language Models in Language Preservation: Opportunities and Challenges [0.0]
Generative AI and large-scale language models (LLM) have emerged as powerful tools in language preservation. This paper examines the role of generative AIs and LLMs in preserving endangered languages, highlighting the risks and challenges associated with their use.
arXiv Detail & Related papers (2025-01-20T14:03:40Z)
Foundation Models for Low-Resource Language Education (Vision Paper) [31.80093028879394]
Large language models (LLMs) are powerful tools for working with natural language. LLMs face challenges when applied to low-resource languages due to limited training data and difficulty in understanding cultural nuances. This paper discusses how LLMs could enhance education for low-resource languages, emphasizing practical applications and benefits.
arXiv Detail & Related papers (2024-12-06T04:34:45Z)
Opportunities and Challenges of Large Language Models for Low-Resource Languages in Humanities Research [23.773194690783512]
Low-resource languages serve as invaluable repositories of human history, embodying cultural evolution and intellectual diversity. Despite their significance, these languages face critical challenges, including data scarcity and technological limitations. Recent advancements in large language models (LLMs) offer transformative opportunities for addressing these challenges.
arXiv Detail & Related papers (2024-11-30T00:10:56Z)
Lens: Rethinking Multilingual Enhancement for Large Language Models [70.85065197789639]
Lens is a novel approach to enhance multilingual capabilities of large language models (LLMs) It operates by manipulating the hidden representations within the language-agnostic and language-specific subspaces from top layers of LLMs. It achieves superior results with much fewer computational resources compared to existing post-training approaches.
arXiv Detail & Related papers (2024-10-06T08:51:30Z)
SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages [77.75535024869224]
We present SeaLLMs 3, the latest iteration of the SeaLLMs model family, tailored for Southeast Asian languages. SeaLLMs 3 aims to bridge this gap by covering a comprehensive range of languages spoken in this region, including English, Chinese, Indonesian, Vietnamese, Thai, Tagalog, Malay, Burmese, Khmer, Lao, Tamil, and Javanese. Our model excels in tasks such as world knowledge, mathematical reasoning, translation, and instruction following, achieving state-of-the-art performance among similarly sized models.
arXiv Detail & Related papers (2024-07-29T03:26:22Z)
Harnessing the Power of Artificial Intelligence to Vitalize Endangered Indigenous Languages: Technologies and Experiences [31.62071644137294]
We discuss the decreasing diversity of languages in the world and how working with Indigenous languages poses unique ethical challenges for AI and NLP. We report encouraging results in the development of high-quality machine learning translators for Indigenous languages. We present prototypes we have built in projects done in 2023 and 2024 with Indigenous communities in Brazil, aimed at facilitating writing.
arXiv Detail & Related papers (2024-07-17T14:46:37Z)
Recent Advancements and Challenges of Turkic Central Asian Language Processing [4.189204855014775]
Research in NLP for Central Asian Turkic languages faces typical low-resource language challenges. Recent advancements have included the collection of language-specific datasets and the development of models for downstream tasks.
arXiv Detail & Related papers (2024-07-06T08:58:26Z)
MoE-CT: A Novel Approach For Large Language Models Training With Resistance To Catastrophic Forgetting [53.77590764277568]
We introduce a novel MoE-CT architecture that separates the base model's learning from the multilingual expansion process. Our design freezes the original LLM parameters, thus safeguarding its performance in high-resource languages, while an appended MoE module, trained on diverse language datasets, augments low-resource language proficiency.
arXiv Detail & Related papers (2024-06-25T11:03:45Z)
From Bytes to Borsch: Fine-Tuning Gemma and Mistral for the Ukrainian Language Representation [0.0]
generative large language models (LLMs) stand at the forefront of innovation, showcasing unparalleled abilities in text understanding and generation. However, the limited representation of low-resource languages like Ukrainian poses a notable challenge, restricting the reach and relevance of this technology. Our paper addresses this by fine-tuning the open-source Gemma and Mistral LLMs with Ukrainian datasets, aiming to improve their linguistic proficiency.
arXiv Detail & Related papers (2024-04-14T04:25:41Z)
Enhancing Multilingual Capabilities of Large Language Models through Self-Distillation from Resource-Rich Languages [60.162717568496355]
Large language models (LLMs) have been pre-trained on multilingual corpora. Their performance still lags behind in most languages compared to a few resource-rich languages.
arXiv Detail & Related papers (2024-02-19T15:07:32Z)
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z)
Overcoming Language Disparity in Online Content Classification with Multimodal Learning [22.73281502531998]
Large language models are now the standard to develop state-of-the-art solutions for text detection and classification tasks. The development of advanced computational techniques and resources is disproportionately focused on the English language. We explore the promise of incorporating the information contained in images via multimodal machine learning.
arXiv Detail & Related papers (2022-05-19T17:56:02Z)
Not always about you: Prioritizing community needs when developing endangered language technology [5.670857685983896]
We discuss the unique technological, cultural, practical, and ethical challenges that researchers and indigenous speech community members face. We report the perspectives of language teachers, Master Speakers and elders from indigenous communities, as well as the point of view of academics.
arXiv Detail & Related papers (2022-04-12T05:59:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.