AI-Driven Generation of Old English: A Framework for Low-Resource Languages
- URL: http://arxiv.org/abs/2507.20111v1
- Date: Sun, 27 Jul 2025 03:29:19 GMT
- Title: AI-Driven Generation of Old English: A Framework for Low-Resource Languages
- Authors: Rodrigo Gabriel Salazar Alva, Matías Nuñez, Cristian López, Javier Martín Arista,
- Abstract summary: Preserving ancient languages is essential for understanding humanity's cultural and linguistic heritage.<n>Old English remains critically under-resourced, limiting its accessibility to modern natural language processing (NLP) techniques.<n>We present a scalable framework that uses advanced large language models (LLMs) to generate high-quality Old English texts.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Preserving ancient languages is essential for understanding humanity's cultural and linguistic heritage, yet Old English remains critically under-resourced, limiting its accessibility to modern natural language processing (NLP) techniques. We present a scalable framework that uses advanced large language models (LLMs) to generate high-quality Old English texts, addressing this gap. Our approach combines parameter-efficient fine-tuning (Low-Rank Adaptation, LoRA), data augmentation via backtranslation, and a dual-agent pipeline that separates the tasks of content generation (in English) and translation (into Old English). Evaluation with automated metrics (BLEU, METEOR, and CHRF) shows significant improvements over baseline models, with BLEU scores increasing from 26 to over 65 for English-to-Old English translation. Expert human assessment also confirms high grammatical accuracy and stylistic fidelity in the generated texts. Beyond expanding the Old English corpus, our method offers a practical blueprint for revitalizing other endangered languages, effectively uniting AI innovation with the goals of cultural preservation.
Related papers
- From Data Scarcity to Data Care: Reimagining Language Technologies for Serbian and other Low-Resource Languages [0.0]
This study examines the structural, historical, and sociotechnical factors shaping language technology development for low resource languages in the AI age.<n>It traces challenges rooted in historical destruction of Serbian textual heritage, intensified by contemporary issues.<n>To address these challenges, the study proposes Data Care, a framework grounded in CARE principles.
arXiv Detail & Related papers (2025-12-11T13:29:25Z) - Large Language Model Data Generation for Enhanced Intent Recognition in German Speech [14.788624194380825]
Intent recognition (IR) for speech commands is essential for artificial intelligence (AI) assistant systems.<n>We propose a novel approach that combines an adapted Whisper ASR model, fine-tuned on elderly German speech.<n>We generate synthetic speech with a text-to-speech model and conduct extensive cross-dataset testing.
arXiv Detail & Related papers (2025-08-08T12:54:09Z) - NeoBabel: A Multilingual Open Tower for Visual Generation [32.79724699684266]
We introduce NeoBabel, a novel multilingual image generation framework.<n>It supports six languages: English, Chinese, Dutch, French, Hindi, and Persian.<n>It achieves state-of-the-art multilingual performance while retaining strong English capability.
arXiv Detail & Related papers (2025-07-08T16:19:45Z) - Overcoming Data Scarcity in Generative Language Modelling for Low-Resource Languages: A Systematic Review [0.7366405857677227]
This paper focuses on strategies to address data scarcity in generative language modelling for low-resource languages (LRL)<n>We identify, categorise and evaluate technical approaches, including monolingual data augmentation, back-translation, multilingual training, and prompt engineering.<n>We conclude with recommendations for extending these methods to a wider range of LRLs and outline open challenges in building equitable generative language systems.
arXiv Detail & Related papers (2025-05-07T16:04:45Z) - Preserving Cultural Identity with Context-Aware Translation Through Multi-Agent AI Systems [0.4218593777811082]
Language is a cornerstone of cultural identity, yet globalization and the dominance of major languages have placed nearly 3,000 languages at risk of extinction.<n>Existing AI-driven translation models prioritize efficiency but often fail to capture cultural nuances, idiomatic expressions, and historical significance.<n>We propose a multi-agent AI framework designed for culturally adaptive translation in underserved language communities.
arXiv Detail & Related papers (2025-03-05T06:43:59Z) - QueEn: A Large Language Model for Quechua-English Translation [20.377876059048692]
We propose QueEn, a novel approach for Quechua-English translation that combines Retrieval-Augmented Generation (RAG) with parameter-efficient fine-tuning techniques.<n>Our approach substantially exceeds baseline models, with a BLEU score of 17.6 compared to 1.5 for standard GPT models.
arXiv Detail & Related papers (2024-12-06T17:04:21Z) - LIMBA: An Open-Source Framework for the Preservation and Valorization of Low-Resource Languages using Generative Models [62.47865866398233]
This white paper proposes a framework to generate linguistic tools for low-resource languages.
By addressing the data scarcity that hinders intelligent applications for such languages, we contribute to promoting linguistic diversity.
arXiv Detail & Related papers (2024-11-20T16:59:41Z) - Trustworthy Alignment of Retrieval-Augmented Large Language Models via Reinforcement Learning [84.94709351266557]
We focus on the trustworthiness of language models with respect to retrieval augmentation.
We deem that retrieval-augmented language models have the inherent capabilities of supplying response according to both contextual and parametric knowledge.
Inspired by aligning language models with human preference, we take the first step towards aligning retrieval-augmented language models to a status where it responds relying merely on the external evidence.
arXiv Detail & Related papers (2024-10-22T09:25:21Z) - Lens: Rethinking Multilingual Enhancement for Large Language Models [70.85065197789639]
We propose Lens, a novel approach to enhance multilingual capabilities in large language models (LLMs)<n>Lens operates on two subspaces: the language-agnostic subspace, where it aligns target languages with the central language to inherit strong semantic representations, and the language-specific subspace, where it separates target and central languages to preserve linguistic specificity.<n>Lens significantly improves multilingual performance while maintaining the model's English proficiency, achieving better results with less computational cost compared to existing post-training approaches.
arXiv Detail & Related papers (2024-10-06T08:51:30Z) - Harnessing the Power of Artificial Intelligence to Vitalize Endangered Indigenous Languages: Technologies and Experiences [31.62071644137294]
We discuss the decreasing diversity of languages in the world and how working with Indigenous languages poses unique ethical challenges for AI and NLP.
We report encouraging results in the development of high-quality machine learning translators for Indigenous languages.
We present prototypes we have built in projects done in 2023 and 2024 with Indigenous communities in Brazil, aimed at facilitating writing.
arXiv Detail & Related papers (2024-07-17T14:46:37Z) - Seamless Language Expansion: Enhancing Multilingual Mastery in Self-Supervised Models [60.09618700199927]
We propose adaptation methods which integrate LoRA to existed SSL models to extend new language.
We also develop preservation strategies which include data combination and re-clustering to retain abilities on existed languages.
arXiv Detail & Related papers (2024-06-20T08:13:30Z) - No Language Left Behind: Scaling Human-Centered Machine Translation [69.28110770760506]
We create datasets and models aimed at narrowing the performance gap between low and high-resource languages.
We propose multiple architectural and training improvements to counteract overfitting while training on thousands of tasks.
Our model achieves an improvement of 44% BLEU relative to the previous state-of-the-art.
arXiv Detail & Related papers (2022-07-11T07:33:36Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.