InstructLR: A Scalable Approach to Create Instruction Dataset for Under-Resourced Languages
- URL: http://arxiv.org/abs/2512.02213v1
- Date: Mon, 01 Dec 2025 21:25:33 GMT
- Title: InstructLR: A Scalable Approach to Create Instruction Dataset for Under-Resourced Languages
- Authors: Mamadou K. Keita, Sebastien Diarra, Christopher Homan, Seydou Diallo,
- Abstract summary: In this paper, we introduce InstructLR, a novel framework designed to generate high-quality instruction datasets for low-resource languages (LRLs)<n>Our approach integrates LLM-driven text generation with a dual-layer quality filtering mechanism.<n>InstructLR has facilitated the creation of three multi-domain instruction benchmarks: ZarmaInstruct-50k, BambaraInstruct-50k, and FulfuldeInstruct-50k.
- Score: 5.046479786355341
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Effective text generation and chat interfaces for low-resource languages (LRLs) remain a challenge for state-of-the-art large language models (LLMs) to support. This is mainly due to the difficulty of curating high-quality instruction datasets for LRLs, a limitation prevalent in the languages spoken across the African continent and other regions. Current approaches, such as automated translation and synthetic data generation, frequently yield outputs that lack fluency or even orthographic consistency. In this paper, we introduce InstructLR, a novel framework designed to generate high-quality instruction datasets for LRLs. Our approach integrates LLM-driven text generation with a dual-layer quality filtering mechanism: an automated filtering layer based on retrieval-augmented-generation (RAG)-based n-shot prompting, and a human-in-the-loop validation layer. Drawing inspiration from benchmarks such as MMLU in task definition, InstructLR has facilitated the creation of three multi-domain instruction benchmarks: ZarmaInstruct-50k, BambaraInstruct-50k, and FulfuldeInstruct-50k.
Related papers
- AgenticTagger: Structured Item Representation for Recommendation with LLM Agents [58.12004213978182]
AgenticTagger is a framework that queries LLMs for representing items with sequences of text descriptors.<n>To effectively and efficiently ground vocabulary in the item corpus of interest, we design a multi-agent reflection mechanism.<n>Experiments on public and private data show AgenticTagger brings consistent improvements across diverse recommendation scenarios.
arXiv Detail & Related papers (2026-02-05T18:01:37Z) - "Don't Teach Minerva": Guiding LLMs Through Complex Syntax for Faithful Latin Translation with RAG [0.5076419064097734]
This paper introduces a reproducible draft-based refinement pipeline that elevates open-source Large Language Models to a performance level statistically comparable to top-tier proprietary systems.<n>We demonstrate the robustness of this approach on two distinct benchmarks: a standard in-domain test set (Rosenthal, 2023) and a new, challenging out-of-domain (OOD) set of 12th-century Latin letters (2025)
arXiv Detail & Related papers (2025-11-03T11:11:27Z) - CorefInst: Leveraging LLMs for Multilingual Coreference Resolution [0.0]
Coreference Resolution (CR) is a crucial yet challenging task in natural language understanding.<n>This study introduces the first multilingual CR methodology which leverages decoder-only LLMs to handle both overt and zero mentions.
arXiv Detail & Related papers (2025-09-22T08:35:21Z) - IFEvalCode: Controlled Code Generation [69.28317223249358]
The paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs.<n>The authors present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages.
arXiv Detail & Related papers (2025-07-30T08:08:48Z) - Large Language Models are Good Relational Learners [55.40941576497973]
We introduce Rel-LLM, a novel architecture that utilizes a graph neural network (GNN)- based encoder to generate structured relational prompts for large language models (LLMs)<n>Unlike traditional text-based serialization approaches, our method preserves the inherent relational structure of databases while enabling LLMs to process and reason over complex entity relationships.
arXiv Detail & Related papers (2025-06-06T04:07:55Z) - UrduLLaMA 1.0: Dataset Curation, Preprocessing, and Evaluation in Low-Resource Settings [0.7874708385247353]
This paper introduces UrduLLaMA 1.0, a model derived from the open-source Llama-3.1-8B-Instruct architecture.<n>We leverage Low-Rank Adaptation (LoRA) to fine tune the model on 41,000 Urdu instructions and approximately 50,000 English-Urdu translation pairs.
arXiv Detail & Related papers (2025-02-24T08:38:21Z) - HiVeGen -- Hierarchical LLM-based Verilog Generation for Scalable Chip Design [24.46771930751068]
HiVeGen is a hierarchical Verilog generation framework that decomposes generation tasks into hierarchical submodules.<n> automatic Design Space Exploration (DSE) into hierarchy-aware prompt generation, introducing weight-based retrieval to enhance code reuse.<n>Real-time human-computer interaction to lower error-correction cost, significantly improving the quality of generated designs.
arXiv Detail & Related papers (2024-12-06T19:37:53Z) - MURI: High-Quality Instruction Tuning Datasets for Low-Resource Languages via Reverse Instructions [54.08017526771947]
Multilingual Reverse Instructions (MURI) generates high-quality instruction tuning datasets for low-resource languages.
MURI produces instruction-output pairs from existing human-written texts in low-resource languages.
Our dataset, MURI-IT, includes more than 2 million instruction-output pairs across 200 languages.
arXiv Detail & Related papers (2024-09-19T17:59:20Z) - Guiding Large Language Models to Generate Computer-Parsable Content [0.6798775532273751]
We propose a method to guide Large Language Models (LLMs) in generating structured content adhering to specific conventions without fine-tuning.
This enhances stability and consistency in generating target data structures, types, or instructions, reducing application development complexities.
arXiv Detail & Related papers (2024-04-08T13:22:24Z) - A Simple but Effective Approach to Improve Structured Language Model
Output for Information Extraction [11.165093163378152]
Large language models (LLMs) have demonstrated impressive abilities in generating unstructured natural language according to instructions.
This paper introduces an efficient method, G&O, to enhance their structured text generation capabilities.
arXiv Detail & Related papers (2024-02-20T20:42:02Z) - CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large
Language Models in 167 Languages [86.90220551111096]
Training datasets for large language models (LLMs) are often not fully disclosed.
We present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages.
arXiv Detail & Related papers (2023-09-17T23:49:10Z) - Extrapolating Multilingual Understanding Models as Multilingual
Generators [82.1355802012414]
This paper explores methods to empower multilingual understanding models the generation abilities to get a unified model.
We propose a textbfSemantic-textbfGuided textbfAlignment-then-Denoising (SGA) approach to adapt an encoder to a multilingual generator with a small number of new parameters.
arXiv Detail & Related papers (2023-05-22T15:33:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.