Related papers: IndicEval: A Bilingual Indian Educational Evaluation Framework for Large Language Models

IndicEval: A Bilingual Indian Educational Evaluation Framework for Large Language Models

URL: http://arxiv.org/abs/2602.16467v1
Date: Wed, 18 Feb 2026 13:55:57 GMT
Title: IndicEval: A Bilingual Indian Educational Evaluation Framework for Large Language Models
Authors: Saurabh Bharti, Gaurav Azad, Abhinaw Jagtap, Nachiket Tapas,
Abstract summary: This paper introduces IndicEval, a scalable benchmarking platform to assess large language models (LLMs) performance.<n>IndicEval grounds evaluation in real examination standards, enabling realistic measurement of reasoning, domain knowledge, and bilingual adaptability.<n>Experiments conducted on Gemini 2.0 Flash, GPT-4, Claude, and LLaMA 3-70B reveal three major findings.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rapid advancement of large language models (LLMs) necessitates evaluation frameworks that reflect real-world academic rigor and multilingual complexity. This paper introduces IndicEval, a scalable benchmarking platform designed to assess LLM performance using authentic high-stakes examination questions from UPSC, JEE, and NEET across STEM and humanities domains in both English and Hindi. Unlike synthetic benchmarks, IndicEval grounds evaluation in real examination standards, enabling realistic measurement of reasoning, domain knowledge, and bilingual adaptability. The framework automates assessment using Zero-Shot, Few-Shot, and Chain-of-Thought (CoT) prompting strategies and supports modular integration of new models and languages. Experiments conducted on Gemini 2.0 Flash, GPT-4, Claude, and LLaMA 3-70B reveal three major findings. First, CoT prompting consistently improves reasoning accuracy, with substantial gains across subjects and languages. Second, significant cross-model performance disparities persist, particularly in high-complexity examinations. Third, multilingual degradation remains a critical challenge, with marked accuracy drops in Hindi compared to English, especially under Zero-Shot conditions. These results highlight persistent gaps in bilingual reasoning and domain transfer. Overall, IndicEval provides a practice-oriented, extensible foundation for rigorous, equitable evaluation of LLMs in multilingual educational settings and offers actionable insights for improving reasoning robustness and language adaptability.

Related papers

A Comprehensive Evaluation of Multilingual Chain-of-Thought Reasoning: Performance, Consistency, and Faithfulness Across Languages [48.68444770923683]
We present the first comprehensive study of multilingual Chain-of-Thought (CoT) reasoning.<n>We measure language compliance, answer accuracy, and answer consistency when LRMs are prompt-hacked to think in a target language.<n>We find that the quality and effectiveness of thinking traces vary substantially depending on the prompt language.
arXiv Detail & Related papers (2025-10-10T17:06:50Z)
Think Natively: Unlocking Multilingual Reasoning with Consistency-Enhanced Reinforcement Learning [85.7304930030649]
We propose M-Thinker, which is trained by a Language Consistency reward and a Cross-lingual Thinking Alignment reward.<n>M-Thinker achieves nearly 100% language consistency and superior performance on two multilingual benchmarks.
arXiv Detail & Related papers (2025-10-08T17:55:02Z)
Does Language Model Understand Language? [1.0450509067356148]
Despite advances in natural language generation and understanding, LM still struggle with fine grained linguistic phenomena.<n>In this study, we conduct a evaluation of SOTA language models across challenging contexts in both English and Bengali.<n>Our findings highlight Compound-Beta as the most balanced model, consistently achieving high correlations and low MAEs across diverse language conditions.
arXiv Detail & Related papers (2025-09-15T21:09:09Z)
mSCoRe: a $M$ultilingual and Scalable Benchmark for $S$kill-based $Co$mmonsense $Re$asoning [74.97363626515236]
We propose a textbfMultilingual and Scalable Benchmark for textbfSkill-based textbfCommonsense textbfReasoning (textbfmSCoRe)<n>Our benchmark incorporates three key components that are designed to systematically evaluate LLM's reasoning capabilities.<n>Our results reveal the limitations of such reasoning-reinforced models when confronted with nuanced multilingual general and cultural commonsense.
arXiv Detail & Related papers (2025-08-13T18:59:02Z)
CogBench: A Large Language Model Benchmark for Multilingual Speech-Based Cognitive Impairment Assessment [23.1730341293796]
We propose CogBench, the first benchmark designed to evaluate the cross-lingual and cross-site generalizability of large language models for speech-based cognitive impairment assessment.<n>Our results show that conventional deep learning models degrade substantially when transferred across domains.<n>Our findings offer a critical step toward building clinically useful and linguistically robust speech-based cognitive assessment tools.
arXiv Detail & Related papers (2025-08-05T12:06:16Z)
Multilingual Self-Taught Faithfulness Evaluators [11.200203292660758]
Self-Taught Evaluators for Multilingual Faithfulness is a framework that learns exclusively from synthetic multilingual summarization data.<n>Our framework shows improvements over existing baselines, including state-of-the-art English evaluators and machine translation-based approaches.
arXiv Detail & Related papers (2025-07-28T12:01:59Z)
Learn Globally, Speak Locally: Bridging the Gaps in Multilingual Reasoning [39.03934159726098]
M2A is a novel method that combines multi-scale multilingual alignment with language-consistency rewards on machine-translated questions.<n>We introduce GeoFact-X, a geography-based multilingual factual reasoning benchmark together with reasoning traces in five languages.<n>Our results show that M2A significantly enhances multilingual reasoning fidelity in both mathematical and factual reasoning tasks.
arXiv Detail & Related papers (2025-07-07T19:04:36Z)
CPG-EVAL: A Multi-Tiered Benchmark for Evaluating the Chinese Pedagogical Grammar Competence of Large Language Models [6.0020878662404975]
This paper introduces the first benchmark specifically designed to evaluate LLMs' knowledge of pedagogical grammar within the context of foreign language instruction.<n>The benchmark comprises five tasks designed to assess grammar recognition, fine-grained grammatical distinction, categorical discrimination, and resistance to linguistic interference.
arXiv Detail & Related papers (2025-04-17T18:01:50Z)
Disce aut Deficere: Evaluating LLMs Proficiency on the INVALSI Italian Benchmark [12.729687989535359]
evaluating Large Language Models (LLMs) in languages other than English is crucial for ensuring their linguistic versatility, cultural relevance, and applicability in diverse global contexts. We tackle this challenge by introducing a structured benchmark using the INVALSI tests, a set of well-established assessments designed to measure educational competencies across Italy.
arXiv Detail & Related papers (2024-06-25T13:20:08Z)
Analyzing and Adapting Large Language Models for Few-Shot Multilingual NLU: Are We There Yet? [82.02076369811402]
Supervised fine-tuning (SFT), supervised instruction tuning (SIT) and in-context learning (ICL) are three alternative, de facto standard approaches to few-shot learning. We present an extensive and systematic comparison of the three approaches, testing them on 6 high- and low-resource languages, three different NLU tasks, and a myriad of language and domain setups. Our observations show that supervised instruction tuning has the best trade-off between performance and resource requirements.
arXiv Detail & Related papers (2024-03-04T10:48:13Z)
LLaMA Beyond English: An Empirical Study on Language Capability Transfer [49.298360366468934]
We focus on how to effectively transfer the capabilities of language generation and following instructions to a non-English language. We analyze the impact of key factors such as vocabulary extension, further pretraining, and instruction tuning on transfer. We employ four widely used standardized testing benchmarks: C-Eval, MMLU, AGI-Eval, and GAOKAO-Bench.
arXiv Detail & Related papers (2024-01-02T06:29:02Z)
AM2iCo: Evaluating Word Meaning in Context across Low-ResourceLanguages with Adversarial Examples [51.048234591165155]
We present AM2iCo, Adversarial and Multilingual Meaning in Context. It aims to faithfully assess the ability of state-of-the-art (SotA) representation models to understand the identity of word meaning in cross-lingual contexts. Results reveal that current SotA pretrained encoders substantially lag behind human performance.
arXiv Detail & Related papers (2021-04-17T20:23:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.