INDIC DIALECT: A Multi Task Benchmark to Evaluate and Translate in Indian Language Dialects
- URL: http://arxiv.org/abs/2601.10388v1
- Date: Thu, 15 Jan 2026 13:40:27 GMT
- Title: INDIC DIALECT: A Multi Task Benchmark to Evaluate and Translate in Indian Language Dialects
- Authors: Tarun Sharma, Manikandan Ravikiran, Sourava Kumar Behera, Pramit Bhattacharya, Arnab Bhattacharya, Rohit Saluja,
- Abstract summary: In India, despite Hindi being the third most spoken language globally (over 600 million speakers), its numerous dialects remain underrepresented.<n>We introduce INDIC-DIALECT, a human-curated parallel corpus of 13k sentence pairs spanning 11 dialects and 2 languages: Hindi and Odia.
- Score: 10.663878830823043
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent NLP advances focus primarily on standardized languages, leaving most low-resource dialects under-served especially in Indian scenarios. In India, the issue is particularly important: despite Hindi being the third most spoken language globally (over 600 million speakers), its numerous dialects remain underrepresented. The situation is similar for Odia, which has around 45 million speakers. While some datasets exist which contain standard Hindi and Odia languages, their regional dialects have almost no web presence. We introduce INDIC-DIALECT, a human-curated parallel corpus of 13k sentence pairs spanning 11 dialects and 2 languages: Hindi and Odia. Using this corpus, we construct a multi-task benchmark with three tasks: dialect classification, multiple-choice question (MCQ) answering, and machine translation (MT). Our experiments show that LLMs like GPT-4o and Gemini 2.5 perform poorly on the classification task. While fine-tuned transformer based models pretrained on Indian languages substantially improve performance e.g., improving F1 from 19.6\% to 89.8\% on dialect classification. For dialect to language translation, we find that hybrid AI model achieves highest BLEU score of 61.32 compared to the baseline score of 23.36. Interestingly, due to complexity in generating dialect sentences, we observe that for language to dialect translation the ``rule-based followed by AI" approach achieves best BLEU score of 48.44 compared to the baseline score of 27.59. INDIC-DIALECT thus is a new benchmark for dialect-aware Indic NLP, and we plan to release it as open source to support further work on low-resource Indian dialects.
Related papers
- DialectGen: Benchmarking and Improving Dialect Robustness in Multimodal Generation [111.94720088481614]
Can multimodal generative models effectively produce content given dialectal textual input?<n>We construct a new large-scale benchmark spanning six common English dialects.<n>We design a general encoder-based mitigation strategy for multimodal generative models.
arXiv Detail & Related papers (2025-10-16T17:56:55Z) - L3Cube-IndicHeadline-ID: A Dataset for Headline Identification and Semantic Evaluation in Low-Resource Indian Languages [2.584263027095689]
L3Cube-IndicHeadline-ID is a curated dataset spanning ten low-resource Indic languages.<n>Each language includes 20,000 news articles paired with four headline variants.<n>The task requires selecting the correct headline from the options using article-headline similarity.<n>We benchmark several sentence transformers, including multilingual and language-specific models, using cosine similarity.
arXiv Detail & Related papers (2025-09-02T16:54:30Z) - Bridging Dialects: Translating Standard Bangla to Regional Variants Using Neural Models [1.472830326343432]
The work is motivated by the need to preserve linguistic diversity and improve communication among dialect speakers.<n>The models were fine-tuned using the "Vashantor" dataset, containing 32,500 sentences across various dialects.<n> BanglaT5 demonstrated superior performance with a CER of 12.3% and WER of 15.7%, highlighting its effectiveness in capturing dialectal nuances.
arXiv Detail & Related papers (2025-01-10T06:50:51Z) - LAHAJA: A Robust Multi-accent Benchmark for Evaluating Hindi ASR Systems [16.143694951047024]
We create a benchmark, LAHAJA, which contains read and extempore speech on a diverse set of topics and use cases.
We evaluate existing open-source and commercial models on LAHAJA and find their performance to be poor.
We train models using different datasets and find that our model trained on multilingual data with good speaker diversity outperforms existing models by a significant margin.
arXiv Detail & Related papers (2024-08-21T08:51:00Z) - Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects [72.18753241750964]
Yorub'a is an African language with roughly 47 million speakers.
Recent efforts to develop NLP technologies for African languages have focused on their standard dialects.
We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus.
arXiv Detail & Related papers (2024-06-27T22:38:04Z) - DIALECTBENCH: A NLP Benchmark for Dialects, Varieties, and Closely-Related Languages [49.38663048447942]
We propose DIALECTBENCH, the first-ever large-scale benchmark for NLP on varieties.
This allows for a comprehensive evaluation of NLP system performance on different language varieties.
We provide substantial evidence of performance disparities between standard and non-standard language varieties.
arXiv Detail & Related papers (2024-03-16T20:18:36Z) - What Do Dialect Speakers Want? A Survey of Attitudes Towards Language Technology for German Dialects [60.8361859783634]
We survey speakers of dialects and regional languages related to German.
We find that respondents are especially in favour of potential NLP tools that work with dialectal input.
arXiv Detail & Related papers (2024-02-19T09:15:28Z) - Multi-VALUE: A Framework for Cross-Dialectal English NLP [49.55176102659081]
Multi- Dialect is a controllable rule-based translation system spanning 50 English dialects.
Stress tests reveal significant performance disparities for leading models on non-standard dialects.
We partner with native speakers of Chicano and Indian English to release new gold-standard variants of the popular CoQA task.
arXiv Detail & Related papers (2022-12-15T18:17:01Z) - CLSE: Corpus of Linguistically Significant Entities [58.29901964387952]
We release a Corpus of Linguistically Significant Entities (CLSE) annotated by experts.
CLSE covers 74 different semantic types to support various applications from airline ticketing to video games.
We create a linguistically representative NLG evaluation benchmark in three languages: French, Marathi, and Russian.
arXiv Detail & Related papers (2022-11-04T12:56:12Z) - MATra: A Multilingual Attentive Transliteration System for Indian
Scripts [0.0]
This paper shows a model that can perform transliteration between any pair among the following five languages.
The model beats the state-of-the-art (for all pairs among the five mentioned languages) and achieves a top-1 accuracy score of 80.7%.
arXiv Detail & Related papers (2022-08-23T08:14:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.