LuxIT: A Luxembourgish Instruction Tuning Dataset from Monolingual Seed Data
- URL: http://arxiv.org/abs/2510.24434v1
- Date: Tue, 28 Oct 2025 14:02:55 GMT
- Title: LuxIT: A Luxembourgish Instruction Tuning Dataset from Monolingual Seed Data
- Authors: Julian Valline, Cedric Lothritz, Jordi Cabot,
- Abstract summary: LuxIT is a novel, monolingual instruction tuning dataset for Luxembourgish developed to mitigate this challenge.<n>We synthesize the dataset from a corpus of native Luxembourgish texts, utilizing DeepSeek-R1-0528, chosen for its shown proficiency in Luxembourgish.
- Score: 2.383798778903081
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The effectiveness of instruction-tuned Large Language Models (LLMs) is often limited in low-resource linguistic settings due to a lack of high-quality training data. We introduce LuxIT, a novel, monolingual instruction tuning dataset for Luxembourgish developed to mitigate this challenge. We synthesize the dataset from a corpus of native Luxembourgish texts, utilizing DeepSeek-R1-0528, chosen for its shown proficiency in Luxembourgish. Following generation, we apply a quality assurance process, employing an LLM-as-a-judge approach. To investigate the practical utility of the dataset, we fine-tune several smaller-scale LLMs on LuxIT. Subsequent benchmarking against their base models on Luxembourgish language proficiency examinations, however, yields mixed results, with performance varying significantly across different models. LuxIT represents a critical contribution to Luxembourgish natural language processing and offers a replicable monolingual methodology, though our findings highlight the need for further research to optimize its application.
Related papers
- Annotation-Efficient Vision-Language Model Adaptation to the Polish Language Using the LLaVA Framework [38.98519875112922]
Vision-language models (VLMs) are trained on English-centric data, limiting their performance in other languages and cultural contexts.<n>We reproduce and adapt the LLaVA-Next methodology to create a set of Polish VLMs.<n>We observe a +9.5% improvement over LLaVA-1.6-una-13B on a Polish-adapted MMBench, along with higher-quality captions in generative evaluations.
arXiv Detail & Related papers (2026-02-15T09:54:40Z) - LuxInstruct: A Cross-Lingual Instruction Tuning Dataset For Luxembourgish [11.26630017746721]
Traditional reliance on machine translation often introduces semantic misalignment and cultural inaccuracies.<n>We create a cross-lingual instruction tuning dataset for Luxembourgish without resorting to machine-generated translations.<n>By leveraging aligned data from English, French, and German, we build a high-quality dataset that preserves linguistic and cultural nuances.
arXiv Detail & Related papers (2025-10-08T14:35:59Z) - Enhancing Code Generation for Low-Resource Languages: No Silver Bullet [55.39571645315926]
Large Language Models (LLMs) rely on large and diverse datasets to learn syntax, semantics, and usage patterns of programming languages.<n>For low-resource languages, the limited availability of such data hampers the models' ability to generalize effectively.<n>We present an empirical study investigating the effectiveness of several approaches for boosting LLMs' performance on low-resource languages.
arXiv Detail & Related papers (2025-01-31T12:23:28Z) - LLMic: Romanian Foundation Language Model [76.09455151754062]
We present LLMic, a foundation language model designed specifically for the Romanian Language.<n>We show that fine-tuning LLMic for language translation after the initial pretraining phase outperforms existing solutions in English-to-Romanian translation tasks.
arXiv Detail & Related papers (2025-01-13T22:14:45Z) - Text Generation Models for Luxembourgish with Limited Data: A Balanced Multilingual Strategy [7.59001382786429]
This paper addresses the challenges in developing language models for less-represented languages, with a focus on Luxembourgish.<n>We propose a novel text generation model based on the T5 architecture, combining limited Luxembourgish data with equal amounts of German and French data.<n>For the evaluation, we introduce LuxGen, a text generation benchmark that is the first of its kind for Luxembourgish.
arXiv Detail & Related papers (2024-12-12T16:23:12Z) - LuxEmbedder: A Cross-Lingual Approach to Enhanced Luxembourgish Sentence Embeddings [8.839362558895594]
Sentence embedding models rely heavily on parallel data, which can be scarce for many low-resource languages, including Luxembourgish.<n>This scarcity results in suboptimal performance of monolingual and cross-lingual sentence embedding models for these languages.<n>We present evidence suggesting that including low-resource languages in parallel training datasets can be more advantageous for other low-resource languages.
arXiv Detail & Related papers (2024-12-04T14:02:12Z) - Assessing Dialect Fairness and Robustness of Large Language Models in Reasoning Tasks [68.33068005789116]
We introduce ReDial, a benchmark containing 1.2K+ parallel query pairs in Standardized English and AAVE.<n>We evaluate widely used models, including GPT, Claude, Llama, Mistral, and the Phi model families.<n>Our work establishes a systematic and objective framework for analyzing LLM bias in dialectal queries.
arXiv Detail & Related papers (2024-10-14T18:44:23Z) - Generating bilingual example sentences with large language models as lexicography assistants [2.6550899846546527]
We present a study of LLMs' performance in generating and rating example sentences for bilingual dictionaries across languages with varying resource levels.
We evaluate the quality of LLM-generated examples against the GDEX (Good Dictionary EXample) criteria: typicality, informativeness, and intelligibility.
arXiv Detail & Related papers (2024-10-04T06:45:48Z) - Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral.
This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z) - Supervised Knowledge Makes Large Language Models Better In-context Learners [94.89301696512776]
Large Language Models (LLMs) exhibit emerging in-context learning abilities through prompt engineering.
The challenge of improving the generalizability and factuality of LLMs in natural language understanding and question answering remains under-explored.
We propose a framework that enhances the reliability of LLMs as it: 1) generalizes out-of-distribution data, 2) elucidates how LLMs benefit from discriminative models, and 3) minimizes hallucinations in generative tasks.
arXiv Detail & Related papers (2023-12-26T07:24:46Z) - CoAnnotating: Uncertainty-Guided Work Allocation between Human and Large
Language Models for Data Annotation [94.59630161324013]
We propose CoAnnotating, a novel paradigm for Human-LLM co-annotation of unstructured texts at scale.
Our empirical study shows CoAnnotating to be an effective means to allocate work from results on different datasets, with up to 21% performance improvement over random baseline.
arXiv Detail & Related papers (2023-10-24T08:56:49Z) - CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large
Language Models in 167 Languages [86.90220551111096]
Training datasets for large language models (LLMs) are often not fully disclosed.
We present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages.
arXiv Detail & Related papers (2023-09-17T23:49:10Z) - Extrapolating Large Language Models to Non-English by Aligning Languages [109.09051737966178]
Existing large language models show disparate capability across different languages.
In this paper, we empower pre-trained LLMs on non-English languages by building semantic alignment across languages.
arXiv Detail & Related papers (2023-08-09T13:32:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.