Related papers: The role of synthetic data in Multilingual, Multi-cultural AI systems: Lessons from Indic Languages

The role of synthetic data in Multilingual, Multi-cultural AI systems: Lessons from Indic Languages

URL: http://arxiv.org/abs/2509.21294v1
Date: Thu, 25 Sep 2025 15:13:00 GMT
Title: The role of synthetic data in Multilingual, Multi-cultural AI systems: Lessons from Indic Languages
Authors: Pranjal A. Chitale, Varun Gumma, Sanchit Ahuja, Prashant Kodali, Manan Uppadhyay, Deepthi Sudharsan, Sunayana Sitaram,
Abstract summary: We introduce Updesh, a large-scale synthetic instruction-following dataset comprising 9.5M data points across 13 Indian languages.<n>A comprehensive evaluation incorporating both automated metrics and human annotation across 10k assessments indicates that generated data is high quality.<n>Models trained on Updesh consistently achieve significant gains on generative tasks and remain competitive on multiple-choice style NLU tasks.
Score: 18.087937520281965
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Developing AI systems that operate effectively across languages while remaining culturally grounded is a long-standing challenge, particularly in low-resource settings. Synthetic data provides a promising avenue, yet its effectiveness in multilingual and multicultural contexts remains underexplored. We investigate the creation and impact of synthetic, culturally contextualized datasets for Indian languages through a bottom-up generation strategy that prompts large open-source LLMs (>= 235B parameters) to ground data generation in language-specific Wikipedia content. This approach complements the dominant top-down paradigm of translating synthetic datasets from high-resource languages such as English. We introduce Updesh, a high-quality large-scale synthetic instruction-following dataset comprising 9.5M data points across 13 Indian languages, encompassing diverse reasoning and generative tasks with an emphasis on long-context, multi-turn capabilities, and alignment with Indian cultural contexts. A comprehensive evaluation incorporating both automated metrics and human annotation across 10k assessments indicates that generated data is high quality; though, human evaluation highlights areas for further improvement. Additionally, we perform downstream evaluations by fine-tuning models on our dataset and assessing the performance across 15 diverse multilingual datasets. Models trained on Updesh consistently achieve significant gains on generative tasks and remain competitive on multiple-choice style NLU tasks. Notably, relative improvements are most pronounced in low and medium-resource languages, narrowing their gap with high-resource languages. These findings provide empirical evidence that effective multilingual AI requires multi-faceted data curation and generation strategies that incorporate context-aware, culturally grounded methodologies.

Related papers

Annotation-Efficient Vision-Language Model Adaptation to the Polish Language Using the LLaVA Framework [38.98519875112922]
Vision-language models (VLMs) are trained on English-centric data, limiting their performance in other languages and cultural contexts.<n>We reproduce and adapt the LLaVA-Next methodology to create a set of Polish VLMs.<n>We observe a +9.5% improvement over LLaVA-1.6-una-13B on a Polish-adapted MMBench, along with higher-quality captions in generative evaluations.
arXiv Detail & Related papers (2026-02-15T09:54:40Z)
BhashaKritika: Building Synthetic Pretraining Data at Scale for Indic Languages [4.279942349440352]
We present a systematic study on the generation and evaluation of synthetic multilingual pretraining data for Indic languages.<n>We construct a large-scale synthetic dataset BhashaKritika, comprising 540B tokens using 5 different techniques for 10 languages.<n>We analyze how language choice, both in the prompt instructions and document grounding, affects data quality.
arXiv Detail & Related papers (2025-11-13T14:12:44Z)
Pragyaan: Designing and Curating High-Quality Cultural Post-Training Datasets for Indian Languages [2.403023083920947]
Existing open-source datasets often lack multilingual coverage, cultural grounding, and task diversity gaps.<n>We introduce a human-in-the-loop pipeline that combines translations with synthetic expansion to produce reliable and diverse Indic post-training data.<n>Our dataset protocol incorporates several often-overlooked dimensions and emphasize task diversity, multi-turn dialogue, instruction fidelity, safety alignment, and preservation of cultural nuance.
arXiv Detail & Related papers (2025-10-08T13:23:45Z)
A method for improving multilingual quality and diversity of instruction fine-tuning datasets [29.07537849245622]
We introduce Multilingual Data Quality and Diversity (M-DaQ) to improve Multilingual Instruction Fine-Tuning (IFT)<n>M-DaQ is a novel method for improving LLMs multilinguality by selecting high-quality and semantically diverse multilingual IFT samples.<n> Empirical results across 18 languages demonstrate that models fine-tuned with M-DaQ achieve significant performance gains over vanilla baselines over 60% win rate.
arXiv Detail & Related papers (2025-09-19T03:07:59Z)
No Language Data Left Behind: A Comparative Study of CJK Language Datasets in the Hugging Face Ecosystem [2.1384640984303216]
We examine how cultural norms, research environments, and institutional practices shape dataset availability and quality.<n>Our findings highlight the large-scale and often institution-driven nature of Chinese datasets, grassroots community-led development in Korean NLP, and an entertainment- and subculture-focused emphasis on Japanese collections.<n>We conclude by discussing best practices for future dataset curation and collaboration, aiming to strengthen resource development across all three languages.
arXiv Detail & Related papers (2025-07-06T10:32:32Z)
SenWiCh: Sense-Annotation of Low-Resource Languages for WiC using Hybrid Methods [1.2091341579150698]
We release datasets of sentences containing polysemous words across ten low-resource languages.<n>To facilitate dataset creation, the paper presents a demonstrably beneficial semi-automatic annotation method.<n>Results highlight the importance of targeted dataset creation and evaluation for effective polysemy disambiguation.
arXiv Detail & Related papers (2025-05-29T17:48:08Z)
Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models [52.22235443948351]
High-quality multilingual training data is essential for effectively pretraining large language models (LLMs)<n>Here, we introduce JQL, a systematic approach that efficiently curates diverse and high-quality multilingual data at scale.<n>JQL distills LLMs' annotation capabilities into lightweight annotators based on pretrained multilingual embeddings.
arXiv Detail & Related papers (2025-05-28T11:06:54Z)
Culturally-Nuanced Story Generation for Reasoning in Low-Resource Languages: The Case of Javanese and Sundanese [12.208154616426052]
We test whether large language models (LLMs) can generate culturally nuanced narratives in Javanese and Sundanese.<n>We compare three data creation strategies: (1) LLM-assisted stories prompted with cultural cues, (2) machine translation from Indonesian benchmarks, and (3) native-written stories.<n>We fine-tune models on each dataset and evaluate on a human-authored test set for classification and generation.
arXiv Detail & Related papers (2025-02-18T15:14:58Z)
P-MMEval: A Parallel Multilingual Multitask Benchmark for Consistent Evaluation of LLMs [84.24644520272835]
We introduce P-MMEval, a large-scale benchmark covering effective fundamental and capability-specialized datasets.<n>P-MMEval delivers consistent language coverage across various datasets and provides parallel samples.<n>We conduct extensive experiments on representative multilingual model series to compare performances across models and tasks.
arXiv Detail & Related papers (2024-11-14T01:29:36Z)
Aya Dataset: An Open-Access Collection for Multilingual Instruction Tuning [49.79783940841352]
Existing datasets are almost all in the English language. We work with fluent speakers of languages from around the world to collect natural instances of instructions and completions. We create the most extensive multilingual collection to date, comprising 513 million instances through templating and translating existing datasets across 114 languages.
arXiv Detail & Related papers (2024-02-09T18:51:49Z)
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages [54.808217147579036]
We conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content.
arXiv Detail & Related papers (2023-09-19T14:42:33Z)
Multi3WOZ: A Multilingual, Multi-Domain, Multi-Parallel Dataset for Training and Evaluating Culturally Adapted Task-Oriented Dialog Systems [64.40789703661987]
Multi3WOZ is a novel multilingual, multi-domain, multi-parallel ToD dataset. It is large-scale and offers culturally adapted dialogs in 4 languages. We describe a complex bottom-up data collection process that yielded the final dataset.
arXiv Detail & Related papers (2023-07-26T08:29:42Z)
WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation [101.00109827301235]
We introduce a novel paradigm for dataset creation based on human and machine collaboration. We use dataset cartography to automatically identify examples that demonstrate challenging reasoning patterns, and instruct GPT-3 to compose new examples with similar patterns. The resulting dataset, WANLI, consists of 108,357 natural language inference (NLI) examples that present unique empirical strengths.
arXiv Detail & Related papers (2022-01-16T03:13:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.