MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs
- URL: http://arxiv.org/abs/2507.17476v1
- Date: Wed, 23 Jul 2025 12:56:31 GMT
- Title: MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs
- Authors: Alexander R. Fabbri, Diego Mares, Jorge Flores, Meher Mankikar, Ernesto Hernandez, Dean Lee, Bing Liu, Chen Xing,
- Abstract summary: We introduce the Multilingual Native Reasoning Challenge (MultiNRC), a benchmark to assess Large Language Models (LLMs)<n>MultiNRC covers four core reasoning categories: language-specific linguistic reasoning, wordplay & riddles, cultural/tradition reasoning, and math reasoning with cultural relevance.<n>For cultural/tradition reasoning and math reasoning with cultural relevance, we also provide English equivalent translations of the multilingual questions by manual translation from native speakers fluent in English.
- Score: 56.87573414161703
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Although recent Large Language Models (LLMs) have shown rapid improvement on reasoning benchmarks in English, the evaluation of such LLMs' multilingual reasoning capability across diverse languages and cultural contexts remains limited. Existing multilingual reasoning benchmarks are typically constructed by translating existing English reasoning benchmarks, biasing these benchmarks towards reasoning problems with context in English language/cultures. In this work, we introduce the Multilingual Native Reasoning Challenge (MultiNRC), a benchmark designed to assess LLMs on more than 1,000 native, linguistic and culturally grounded reasoning questions written by native speakers in French, Spanish, and Chinese. MultiNRC covers four core reasoning categories: language-specific linguistic reasoning, wordplay & riddles, cultural/tradition reasoning, and math reasoning with cultural relevance. For cultural/tradition reasoning and math reasoning with cultural relevance, we also provide English equivalent translations of the multilingual questions by manual translation from native speakers fluent in English. This set of English equivalents can provide a direct comparison of LLM reasoning capacity in other languages vs. English on the same reasoning questions. We systematically evaluate current 14 leading LLMs covering most LLM families on MultiNRC and its English equivalent set. The results show that (1) current LLMs are still not good at native multilingual reasoning, with none scoring above 50% on MultiNRC; (2) LLMs exhibit distinct strengths and weaknesses in handling linguistic, cultural, and logical reasoning tasks; (3) Most models perform substantially better in math reasoning in English compared to in original languages (+10%), indicating persistent challenges with culturally grounded knowledge.
Related papers
- Learn Globally, Speak Locally: Bridging the Gaps in Multilingual Reasoning [38.52080213211765]
We introduce GeoFact-X, a geography-based multilingual factual reasoning benchmark with annotated reasoning traces in five languages.<n>We propose BRIDGE, a novel training method that guides supervised fine-tuning and test-time reinforcement learning.<n>Our results show that BRIDGE significantly enhances multilingual reasoning fidelity.
arXiv Detail & Related papers (2025-07-07T19:04:36Z) - XToM: Exploring the Multilingual Theory of Mind for Large Language Models [57.9821865189077]
Existing evaluations of Theory of Mind in LLMs are largely limited to English.<n>We present XToM, a rigorously validated multilingual benchmark that evaluates ToM across five languages.<n>Our findings expose limitations in LLMs' ability to replicate human-like mentalizing across linguistic contexts.
arXiv Detail & Related papers (2025-06-03T05:23:25Z) - When Less Language is More: Language-Reasoning Disentanglement Makes LLMs Better Multilingual Reasoners [111.50503126693444]
We show that language-specific ablation consistently boosts multilingual reasoning performance.<n>Compared to post-training, our training-free ablation achieves comparable or superior results with minimal computational overhead.
arXiv Detail & Related papers (2025-05-21T08:35:05Z) - Crosslingual Reasoning through Test-Time Scaling [51.55526326294275]
We find that scaling up inference compute for English-centric reasoning language models (RLMs) improves multilingual mathematical reasoning across many languages.<n>While English-centric RLM's CoTs are naturally predominantly English, they consistently follow a quote-and-think pattern to reason about quoted non-English inputs.<n>We observe poor out-of-domain reasoning generalization, in particular from STEM to cultural commonsense knowledge, even for English.
arXiv Detail & Related papers (2025-05-08T16:50:06Z) - PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts [79.84059473102778]
PolyMath is a multilingual mathematical reasoning benchmark covering 18 languages and 4 easy-to-hard difficulty levels.<n>Our benchmark ensures difficulty comprehensiveness, language diversity, and high-quality translation.
arXiv Detail & Related papers (2025-04-25T15:39:04Z) - Evaluating Large Language Models with Tests of Spanish as a Foreign Language: Pass or Fail? [2.9630910534509924]
We evaluate the performance of state-of-the-art LLMs in the recently released benchmark with similar questions to those of Spanish exams for foreign students.
Results show that LLMs perform well at understanding Spanish but are still far from achieving the level of a native speaker in terms of grammatical competence.
arXiv Detail & Related papers (2024-09-08T11:30:03Z) - Is Translation All You Need? A Study on Solving Multilingual Tasks with Large Language Models [79.46179534911019]
Large language models (LLMs) have demonstrated multilingual capabilities, yet they are mostly English-centric due to imbalanced training corpora.<n>We extend the evaluation to real-world user queries and non-English-centric LLMs, offering a broader examination of multilingual performance.
arXiv Detail & Related papers (2024-03-15T12:47:39Z) - Breaking the Language Barrier: Improving Cross-Lingual Reasoning with
Structured Self-Attention [18.439771003766026]
We study whether multilingual language models (MultiLMs) can transfer logical reasoning abilities to other languages when they are fine-tuned for reasoning in a different language.
We demonstrate that although MultiLMs can transfer reasoning ability across languages in a monolingual setting, they struggle to transfer reasoning abilities in a code-switched setting.
Following this observation, we propose a novel attention mechanism that uses a dedicated set of parameters to encourage cross-lingual attention in code-switched sequences.
arXiv Detail & Related papers (2023-10-23T18:06:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.