Related papers: BhashaBench V1: A Comprehensive Benchmark for the Quadrant of Indic Domains

BhashaBench V1: A Comprehensive Benchmark for the Quadrant of Indic Domains

URL: http://arxiv.org/abs/2510.25409v2
Date: Thu, 30 Oct 2025 10:48:05 GMT
Title: BhashaBench V1: A Comprehensive Benchmark for the Quadrant of Indic Domains
Authors: Vijay Devane, Mohd Nauman, Bhargav Patel, Aniket Mahendra Wakchoure, Yogeshkumar Sant, Shyam Pawar, Viraj Thakur, Ananya Godse, Sunil Patra, Neha Maurya, Suraj Racha, Nitish Kamal Singh, Ajay Nagpal, Piyush Sawarkar, Kundeshwar Vijayrao Pundalik, Rohit Saluja, Ganesh Ramakrishnan,
Abstract summary: BhashaBench V1 contains 74,166 meticulously curated question-answer pairs, with 52,494 in English and 21,672 in Hindi.<n>It spans four major domains: Agriculture, Legal, Finance, and Ayurveda.<n> evaluation of 29+ LLMs reveals significant domain and language specific performance gaps.
Score: 10.342942323713118
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid advancement of large language models(LLMs) has intensified the need for domain and culture specific evaluation. Existing benchmarks are largely Anglocentric and domain-agnostic, limiting their applicability to India-centric contexts. To address this gap, we introduce BhashaBench V1, the first domain-specific, multi-task, bilingual benchmark focusing on critical Indic knowledge systems. BhashaBench V1 contains 74,166 meticulously curated question-answer pairs, with 52,494 in English and 21,672 in Hindi, sourced from authentic government and domain-specific exams. It spans four major domains: Agriculture, Legal, Finance, and Ayurveda, comprising 90+ subdomains and covering 500+ topics, enabling fine-grained evaluation. Evaluation of 29+ LLMs reveals significant domain and language specific performance gaps, with especially large disparities in low-resource domains. For instance, GPT-4o achieves 76.49% overall accuracy in Legal but only 59.74% in Ayurveda. Models consistently perform better on English content compared to Hindi across all domains. Subdomain-level analysis shows that areas such as Cyber Law, International Finance perform relatively well, while Panchakarma, Seed Science, and Human Rights remain notably weak. BhashaBench V1 provides a comprehensive dataset for evaluating large language models across India's diverse knowledge domains. It enables assessment of models' ability to integrate domain-specific knowledge with bilingual understanding. All code, benchmarks, and resources are publicly available to support open research.

Related papers

KOCO-BENCH: Can Large Language Models Leverage Domain Knowledge in Software Development? [58.85952408038657]
Large language models (LLMs) excel at general programming but struggle with domain-specific software development.<n>Existing domain-specific code benchmarks cannot evaluate the effectiveness of domain specialization methods.<n>We present KOCO-BENCH, a novel benchmark designed for evaluating domain specialization methods in real-world software development.
arXiv Detail & Related papers (2026-01-19T17:20:16Z)
CricBench: A Multilingual Benchmark for Evaluating LLMs in Cricket Analytics [1.3986052226424095]
Large Language Models (LLMs) handle the domain-specific nuances, complex variations, and multilingual schemas inherent to sports analytics.<n>We present CricBench, a comprehensive benchmark suite for evaluating LLMs on specialized cricket data.<n>We evaluate six state-of-the-art models, including GPT-4o, Claude 3.7 Sonnet, and open-source models, using a strict evaluation protocol.
arXiv Detail & Related papers (2025-12-26T05:59:19Z)
OmniEduBench: A Comprehensive Chinese Benchmark for Evaluating Large Language Models in Education [72.40048732210055]
We introduce OmniEduBench, a comprehensive Chinese educational benchmark.<n>The data is meticulously divided into two core dimensions: the knowledge dimension and the cultivation dimension.<n>The dataset features a rich variety of question formats, including 11 common exam question types.
arXiv Detail & Related papers (2025-10-30T12:16:29Z)
CorIL: Towards Enriching Indian Language to Indian Language Parallel Corpora and Machine Translation Systems [18.521673953685575]
India's linguistic landscape is one of the most diverse in the world, comprising over 120 major languages and approximately 1,600 additional languages.<n>Despite recent progress in multilingual neural machine translation (NMT), high-quality parallel corpora for Indian languages remain scarce.<n>In this paper, we introduce a large-scale, high-quality annotated parallel corpus covering 11 languages.
arXiv Detail & Related papers (2025-09-24T09:48:26Z)
Evaluating Large Language Model with Knowledge Oriented Language Specific Simple Question Answering [73.73820209993515]
We introduce KoLasSimpleQA, the first benchmark evaluating the multilingual factual ability of Large Language Models (LLMs)<n>Inspired by existing research, we created the question set with features such as single knowledge point coverage, absolute objectivity, unique answers, and temporal stability.<n>Results show significant performance differences between the two domains.
arXiv Detail & Related papers (2025-05-22T12:27:02Z)
MILU: A Multi-task Indic Language Understanding Benchmark [7.652738829153342]
We introduce MILU, a comprehensive evaluation benchmark designed to assess Large Language Models in Indic languages.<n>With an India-centric design, MILU incorporates material from regional and state-level examinations, covering topics such as local history, arts, festivals, and laws, alongside standard subjects like science and mathematics.<n>Open multilingual models outperform language-specific fine-tuned models, which perform only slightly better than random baselines.
arXiv Detail & Related papers (2024-11-04T19:17:17Z)
LlamaLens: Specialized Multilingual LLM for Analyzing News and Social Media Content [9.539308087147134]
Large Language Models (LLMs) have demonstrated remarkable success as general-purpose task solvers across various fields.<n>This study focuses on developing a specialized LLM, LlamaLens, for analyzing news and social media content in a multilingual context.<n>We demonstrate that LlamaLens outperforms the current state-of-the-art (SOTA) on 23 testing sets, and achieves comparable performance on 8 sets.
arXiv Detail & Related papers (2024-10-20T06:37:37Z)
Large Language Model for Multi-Domain Translation: Benchmarking and Domain CoT Fine-tuning [55.107329995417786]
Large language models (LLMs) have demonstrated impressive general understanding and generation abilities. We establish a benchmark for multi-domain translation, featuring 25 German$Leftrightarrow$English and 22 Chinese$Leftrightarrow$English test sets. We propose a domain Chain of Thought (CoT) fine-tuning technique that utilizes the intrinsic multi-domain intelligence of LLMs to improve translation performance.
arXiv Detail & Related papers (2024-10-03T16:15:04Z)
Domain Specialization as the Key to Make Large Language Models Disruptive: A Comprehensive Survey [100.24095818099522]
Large language models (LLMs) have significantly advanced the field of natural language processing (NLP) They provide a highly useful, task-agnostic foundation for a wide range of applications. However, directly applying LLMs to solve sophisticated problems in specific domains meets many hurdles.
arXiv Detail & Related papers (2023-05-30T03:00:30Z)
Domain Mastery Benchmark: An Ever-Updating Benchmark for Evaluating Holistic Domain Knowledge of Large Language Model--A Preliminary Release [13.603414598813938]
DomMa targets at testing Large Language Models (LLMs) on their domain knowledge understanding. It features extensive domain coverage, large data volume, and a continually updated data set based on Chinese 112 first-level subject classifications.
arXiv Detail & Related papers (2023-04-23T15:11:49Z)
Learning Domain-Specialised Representations for Cross-Lingual Biomedical Entity Linking [66.76141128555099]
We propose a novel cross-lingual biomedical entity linking task (XL-BEL) We first investigate the ability of standard knowledge-agnostic as well as knowledge-enhanced monolingual and multilingual LMs beyond the standard monolingual English BEL task. We then address the challenge of transferring domain-specific knowledge in resource-rich languages to resource-poor ones.
arXiv Detail & Related papers (2021-05-30T00:50:00Z)
Open Domain Generalization with Domain-Augmented Meta-Learning [83.59952915761141]
We study a novel and practical problem of Open Domain Generalization (OpenDG) We propose a Domain-Augmented Meta-Learning framework to learn open-domain generalizable representations. Experiment results on various multi-domain datasets demonstrate that the proposed Domain-Augmented Meta-Learning (DAML) outperforms prior methods for unseen domain recognition.
arXiv Detail & Related papers (2021-04-08T09:12:24Z)
DomBERT: Domain-oriented Language Model for Aspect-based Sentiment Analysis [71.40586258509394]
We propose DomBERT, an extension of BERT to learn from both in-domain corpus and relevant domain corpora. Experiments are conducted on an assortment of tasks in aspect-based sentiment analysis, demonstrating promising results.
arXiv Detail & Related papers (2020-04-28T21:07:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.