LuxInstruct: A Cross-Lingual Instruction Tuning Dataset For Luxembourgish
- URL: http://arxiv.org/abs/2510.07074v1
- Date: Wed, 08 Oct 2025 14:35:59 GMT
- Title: LuxInstruct: A Cross-Lingual Instruction Tuning Dataset For Luxembourgish
- Authors: Fred Philippy, Laura Bernardy, Siwen Guo, Jacques Klein, Tegawendé F. Bissyandé,
- Abstract summary: Traditional reliance on machine translation often introduces semantic misalignment and cultural inaccuracies.<n>We create a cross-lingual instruction tuning dataset for Luxembourgish without resorting to machine-generated translations.<n>By leveraging aligned data from English, French, and German, we build a high-quality dataset that preserves linguistic and cultural nuances.
- Score: 11.26630017746721
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Instruction tuning has become a key technique for enhancing the performance of large language models, enabling them to better follow human prompts. However, low-resource languages such as Luxembourgish face severe limitations due to the lack of high-quality instruction datasets. Traditional reliance on machine translation often introduces semantic misalignment and cultural inaccuracies. In this work, we address these challenges by creating a cross-lingual instruction tuning dataset for Luxembourgish, without resorting to machine-generated translations into it. Instead, by leveraging aligned data from English, French, and German, we build a high-quality dataset that preserves linguistic and cultural nuances. We provide evidence that cross-lingual instruction tuning not only improves representational alignment across languages but also the model's generative capabilities in Luxembourgish. This highlights how cross-lingual data curation can avoid the common pitfalls of machine-translated data and directly benefit low-resource language development.
Related papers
- What Drives Cross-lingual Ranking? Retrieval Approaches with Multilingual Language Models [0.19116784879310025]
Cross-lingual information retrieval is challenging due to disparities in resources, scripts, and weak cross-lingual semantic alignment in embedding models.<n>Existing pipelines often rely on translation and monolingual retrievals, which add computational overhead and noise, performance.<n>This work systematically evaluates four intervention types, namely document translation, multilingual dense retrieval with pretrained encoders, contrastive learning at word, phrase, and query-document levels, and cross-encoder re-ranking, across three benchmark datasets.
arXiv Detail & Related papers (2025-11-24T17:17:40Z) - LuxIT: A Luxembourgish Instruction Tuning Dataset from Monolingual Seed Data [2.383798778903081]
LuxIT is a novel, monolingual instruction tuning dataset for Luxembourgish developed to mitigate this challenge.<n>We synthesize the dataset from a corpus of native Luxembourgish texts, utilizing DeepSeek-R1-0528, chosen for its shown proficiency in Luxembourgish.
arXiv Detail & Related papers (2025-10-28T14:02:55Z) - Text Generation Models for Luxembourgish with Limited Data: A Balanced Multilingual Strategy [7.59001382786429]
This paper addresses the challenges in developing language models for less-represented languages, with a focus on Luxembourgish.<n>We propose a novel text generation model based on the T5 architecture, combining limited Luxembourgish data with equal amounts of German and French data.<n>For the evaluation, we introduce LuxGen, a text generation benchmark that is the first of its kind for Luxembourgish.
arXiv Detail & Related papers (2024-12-12T16:23:12Z) - LuxEmbedder: A Cross-Lingual Approach to Enhanced Luxembourgish Sentence Embeddings [8.839362558895594]
Sentence embedding models rely heavily on parallel data, which can be scarce for many low-resource languages, including Luxembourgish.<n>This scarcity results in suboptimal performance of monolingual and cross-lingual sentence embedding models for these languages.<n>We present evidence suggesting that including low-resource languages in parallel training datasets can be more advantageous for other low-resource languages.
arXiv Detail & Related papers (2024-12-04T14:02:12Z) - MURI: High-Quality Instruction Tuning Datasets for Low-Resource Languages via Reverse Instructions [54.08017526771947]
Multilingual Reverse Instructions (MURI) generates high-quality instruction tuning datasets for low-resource languages.
MURI produces instruction-output pairs from existing human-written texts in low-resource languages.
Our dataset, MURI-IT, includes more than 2 million instruction-output pairs across 200 languages.
arXiv Detail & Related papers (2024-09-19T17:59:20Z) - X-Instruction: Aligning Language Model in Low-resource Languages with Self-curated Cross-lingual Instructions [43.90353059292894]
Large language models respond well in high-resource languages like English but struggle in low-resource languages.
We propose a novel method to construct cross-lingual instruction following samples with instruction in English and response in low-resource languages.
arXiv Detail & Related papers (2024-05-30T06:45:23Z) - Advancing Translation Preference Modeling with RLHF: A Step Towards
Cost-Effective Solution [57.42593422091653]
We explore leveraging reinforcement learning with human feedback to improve translation quality.
A reward model with strong language capabilities can more sensitively learn the subtle differences in translation quality.
arXiv Detail & Related papers (2024-02-18T09:51:49Z) - NusaWrites: Constructing High-Quality Corpora for Underrepresented and
Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages.
We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets.
Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z) - Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - InstructAlign: High-and-Low Resource Language Alignment via Continual
Crosslingual Instruction Tuning [66.31509106146605]
Large language models (LLMs) that are tuned with instructions have demonstrated remarkable capabilities in various tasks and languages.
However, their ability to generalize to underrepresented languages is limited due to the scarcity of available data.
We propose InstructAlign which uses continual crosslingual instruction tuning to enable LLMs to align new unseen languages with previously learned high-resource languages.
arXiv Detail & Related papers (2023-05-23T02:51:34Z) - Romanization-based Large-scale Adaptation of Multilingual Language
Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP.
We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages.
Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z) - XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning [68.57658225995966]
Cross-lingual Choice of Plausible Alternatives (XCOPA) is a typologically diverse multilingual dataset for causal commonsense reasoning in 11 languages.
We evaluate a range of state-of-the-art models on this novel dataset, revealing that the performance of current methods falls short compared to translation-based transfer.
arXiv Detail & Related papers (2020-05-01T12:22:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.