Related papers: UrduBench: An Urdu Reasoning Benchmark using Contextually Ensembled Translations with Human-in-the-Loop

UrduBench: An Urdu Reasoning Benchmark using Contextually Ensembled Translations with Human-in-the-Loop

URL: http://arxiv.org/abs/2601.21000v1
Date: Wed, 28 Jan 2026 19:49:17 GMT
Title: UrduBench: An Urdu Reasoning Benchmark using Contextually Ensembled Translations with Human-in-the-Loop
Authors: Muhammad Ali Shafique, Areej Mehboob, Layba Fiaz, Muhammad Usman Qadeer, Hamza Farooq,
Abstract summary: We propose a contextually ensembled translation framework with human-in-the-loop validation to develop Urdu reasoning benchmarks.<n>Using this framework, we translate widely adopted reasoning and question-answering benchmarks, including MGSM, MATH-500, CommonSenseQA, and OpenBookQA, into Urdu.<n>Our analysis reveals performance differences across (1) four datasets, (2) five task difficulty levels, (3) diverse model architectures, (4) multiple model scaling settings, and (5) language consistency tests.
Score: 0.17126708168238125
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in large language models (LLMs) have led to strong reasoning capabilities; however, evaluating such models in low-resource languages remains challenging due to the lack of standardized benchmarks. In particular, Urdu reasoning evaluation has been limited by the sensitivity of machine translation and an emphasis on general language tasks rather than reasoning benchmarks. In this paper, we propose a contextually ensembled translation framework with human-in-the-loop validation that leverages multiple translation systems to develop Urdu reasoning benchmarks while preserving contextual and structural integrity. Using this framework, we translate widely adopted reasoning and question-answering benchmarks, including MGSM, MATH-500, CommonSenseQA, and OpenBookQA, into Urdu, collectively referred to as UrduBench, and conduct a comprehensive evaluation of both reasoning-oriented and instruction-tuned LLMs across multiple prompting strategies. Our analysis reveals performance differences across (1) four datasets, (2) five task difficulty levels, (3) diverse model architectures, (4) multiple model scaling settings, and (5) language consistency tests. We find that multi-step and symbolic reasoning tasks pose significant challenges in Urdu, and that stable language alignment is a critical prerequisite for robust reasoning. Overall, our work establishes a scalable methodology for standardized reasoning evaluation in Urdu and provides empirical insights into multilingual reasoning failures. This experimental setup is also broadly applicable to other low-resource languages. The code and datasets will be publicly released.

Related papers

IndicEval: A Bilingual Indian Educational Evaluation Framework for Large Language Models [0.0]
This paper introduces IndicEval, a scalable benchmarking platform to assess large language models (LLMs) performance.<n>IndicEval grounds evaluation in real examination standards, enabling realistic measurement of reasoning, domain knowledge, and bilingual adaptability.<n>Experiments conducted on Gemini 2.0 Flash, GPT-4, Claude, and LLaMA 3-70B reveal three major findings.
arXiv Detail & Related papers (2026-02-18T13:55:57Z)
mSCoRe: a $M$ultilingual and Scalable Benchmark for $S$kill-based $Co$mmonsense $Re$asoning [74.97363626515236]
We propose a textbfMultilingual and Scalable Benchmark for textbfSkill-based textbfCommonsense textbfReasoning (textbfmSCoRe)<n>Our benchmark incorporates three key components that are designed to systematically evaluate LLM's reasoning capabilities.<n>Our results reveal the limitations of such reasoning-reinforced models when confronted with nuanced multilingual general and cultural commonsense.
arXiv Detail & Related papers (2025-08-13T18:59:02Z)
MMATH: A Multilingual Benchmark for Mathematical Reasoning [94.05289799605957]
We introduce MMATH, a benchmark for multilingual complex reasoning spanning 374 high-quality math problems across 10 typologically diverse languages.<n>We observe that even advanced models like DeepSeek R1 exhibit substantial performance disparities across languages and suffer from a critical off-target issue-generating responses in unintended languages.<n>Our findings offer new insights and practical strategies for advancing the multilingual reasoning capabilities of large language models.
arXiv Detail & Related papers (2025-05-25T12:47:39Z)
Language Matters: How Do Multilingual Input and Reasoning Paths Affect Large Reasoning Models? [59.970391602080205]
Despite multilingual training, LRMs tend to default to reasoning in high-resource languages at test time.<n>Cultural reasoning degrades performance on reasoning tasks but benefits cultural tasks, while safety evaluations exhibit language-specific behavior.
arXiv Detail & Related papers (2025-05-23T02:46:18Z)
ScholarBench: A Bilingual Benchmark for Abstraction, Comprehension, and Reasoning Evaluation in Academic Contexts [8.181151553582488]
textttScholarBench is a benchmark for evaluating the academic reasoning ability of large language models (LLMs)<n>The benchmark comprises 5,031 examples in Korean and 5,309 in English, with even state-of-the-art models like o3-mini achieving an average evaluation score of only 0.543.
arXiv Detail & Related papers (2025-05-22T11:59:06Z)
PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts [85.78821098963607]
PolyMath is a multilingual mathematical reasoning benchmark covering 18 languages and 4 easy-to-hard difficulty levels.<n>Our benchmark ensures difficulty comprehensiveness, language diversity, and high-quality translation.
arXiv Detail & Related papers (2025-04-25T15:39:04Z)
XIFBench: Evaluating Large Language Models on Multilingual Instruction Following [59.549015333755186]
Large Language Models (LLMs) have demonstrated remarkable instruction-following capabilities across various applications.<n>Existing evaluations lack fine-grained constraint analysis across diverse linguistic contexts.<n>We introduce XIFBench, a comprehensive benchmark for evaluating multilingual instruction-following abilities of LLMs.
arXiv Detail & Related papers (2025-03-10T17:07:52Z)
Multilingual European Language Models: Benchmarking Approaches and Challenges [2.413212225810367]
generative large language models (LLMs) can solve different tasks through chat interaction.<n>This paper analyses the benefits and limitations of current evaluation datasets, focusing on multilingual European benchmarks.<n>We discuss potential solutions to enhance translation quality and cultural biases, including human-in-the-loop verification and iterative translation ranking.
arXiv Detail & Related papers (2025-02-18T14:32:17Z)
ZNO-Eval: Benchmarking reasoning capabilities of large language models in Ukrainian [0.0]
This paper presents the ZNO-Eval benchmark based on real exam tasks from Ukraine's standardized educational testing system.<n>It paves the way toward a thorough analysis of reasoning capabilities across different domains and complexities.<n> evaluation of several well-known language models, such as GPT-3.5-Turbo, GPT-4o, GPT-4-Turbo, Mistral Large, Claude 3 Opus, and Gemini-1.5 Pro.
arXiv Detail & Related papers (2025-01-12T04:49:06Z)
LinguaLIFT: An Effective Two-stage Instruction Tuning Framework for Low-Resource Language Reasoning [28.288949710191158]
Large language models (LLMs) have exhibited impressive multilingual reasoning capabilities, driven by extensive multilingual pre-training corpora and instruction fine-tuning data.<n>A performance gap exists between high- and low-resource language reasoning tasks due to the language imbalance in the pre-training corpus.<n>We propose LinguaLIFT, a two-stage instruction tuning framework for advancing low-resource language reasoning.
arXiv Detail & Related papers (2024-12-17T03:03:17Z)
CUGE: A Chinese Language Understanding and Generation Evaluation Benchmark [144.05723617401674]
General-purpose language intelligence evaluation has been a longstanding goal for natural language processing. We argue that for general-purpose language intelligence evaluation, the benchmark itself needs to be comprehensive and systematic. We propose CUGE, a Chinese Language Understanding and Generation Evaluation benchmark with the following features.
arXiv Detail & Related papers (2021-12-27T11:08:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.