Why Do Multilingual Reasoning Gaps Emerge in Reasoning Language Models?
- URL: http://arxiv.org/abs/2510.27269v1
- Date: Fri, 31 Oct 2025 08:17:59 GMT
- Title: Why Do Multilingual Reasoning Gaps Emerge in Reasoning Language Models?
- Authors: Deokhyung Kang, Seonjeong Hwang, Daehui Kim, Hyounghun Kim, Gary Geunbae Lee,
- Abstract summary: Reasoning language models (RLMs) achieve strong performance on complex reasoning tasks, yet they still suffer from a multilingual reasoning gap.<n>This paper shows that the multilingual reasoning gap largely stems from failures in language understanding.<n>We propose Selective Translation, a simple yet effective strategy that translates the multilingual input into English only when an understanding failure is detected.
- Score: 18.99223776816893
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reasoning language models (RLMs) achieve strong performance on complex reasoning tasks, yet they still suffer from a multilingual reasoning gap, performing better in high-resource languages than in low-resource ones. While recent efforts have reduced this gap, its underlying causes remain largely unexplored. In this paper, we address this by showing that the multilingual reasoning gap largely stems from failures in language understanding-the model's inability to represent the multilingual input meaning into the dominant language (i.e., English) within its reasoning trace. This motivates us to examine whether understanding failures can be detected, as this ability could help mitigate the multilingual reasoning gap. To this end, we evaluate a range of detection methods and find that understanding failures can indeed be identified, with supervised approaches performing best. Building on this, we propose Selective Translation, a simple yet effective strategy that translates the multilingual input into English only when an understanding failure is detected. Experimental results show that Selective Translation bridges the multilingual reasoning gap, achieving near full-translation performance while using translation for only about 20% of inputs. Together, our work demonstrates that understanding failures are the primary cause of the multilingual reasoning gap and can be detected and selectively mitigated, providing key insight into its origin and a promising path toward more equitable multilingual reasoning. Our code and data are publicly available at https://github.com/deokhk/RLM_analysis.
Related papers
- Align to the Pivot: Dual Alignment with Self-Feedback for Multilingual Math Reasoning [71.4175109189942]
We present Pivot-Aligned Self-Feedback Multilingual Reasoning (PASMR)<n>This approach designates the model's primary language as the pivot language.<n>It establishes a cross-lingual self-feedback mechanism without relying on external correct answers or reward models.
arXiv Detail & Related papers (2026-01-25T03:20:00Z) - Large Reasoning Models Are (Not Yet) Multilingual Latent Reasoners [48.68444770923683]
Large reasoning models (LRMs) achieve strong performance on mathematical reasoning tasks.<n>LRMs often arrive at the correct answer before completing these textual reasoning steps.<n>This phenomenon has been explored in English, but its multilingual behavior remains largely unknown.
arXiv Detail & Related papers (2026-01-06T13:20:17Z) - Learn Globally, Speak Locally: Bridging the Gaps in Multilingual Reasoning [39.03934159726098]
M2A is a novel method that combines multi-scale multilingual alignment with language-consistency rewards on machine-translated questions.<n>We introduce GeoFact-X, a geography-based multilingual factual reasoning benchmark together with reasoning traces in five languages.<n>Our results show that M2A significantly enhances multilingual reasoning fidelity in both mathematical and factual reasoning tasks.
arXiv Detail & Related papers (2025-07-07T19:04:36Z) - The Translation Barrier Hypothesis: Multilingual Generation with Large Language Models Suffers from Implicit Translation Failure [47.37347291981968]
We show that the failure of the translation stage, despite task-solving success, is an important culprit for the observed low quality of final outputs.<n>We quantify the extent to which either stage is responsible for final failure for a word translation task across 108 language pairs.
arXiv Detail & Related papers (2025-06-28T02:09:21Z) - Cross-lingual Collapse: How Language-Centric Foundation Models Shape Reasoning in Large Language Models [44.94287386776289]
We identify textbfCross-lingual Collapse, a systematic drift in which a multilingual language model reverts to its dominant pre-training language.<n>Our experiments reveal three key findings: (i) GRPO rapidly amplifies pre-training language imbalances, leading to the erosion of low-resource languages within just a few hundred updates; (ii) language consistency reward mitigates this drift but does so at the expense of an almost 5 - 10 pp drop in accuracy.
arXiv Detail & Related papers (2025-06-06T08:08:48Z) - When Models Reason in Your Language: Controlling Thinking Language Comes at the Cost of Accuracy [16.897177356930104]
Large Reasoning Models (LRMs) with thinking traces have shown strong performance on English reasoning tasks.<n>This capability is as important as answer accuracy for real world applications because users may find the reasoning trace useful for oversight only when it is expressed in their own language.<n>We evaluate two leading families of LRMs on our XReasoning benchmark and find that even the most advanced models often revert to English or produce fragmented reasoning in other languages.
arXiv Detail & Related papers (2025-05-28T21:44:12Z) - MMATH: A Multilingual Benchmark for Mathematical Reasoning [94.05289799605957]
We introduce MMATH, a benchmark for multilingual complex reasoning spanning 374 high-quality math problems across 10 typologically diverse languages.<n>We observe that even advanced models like DeepSeek R1 exhibit substantial performance disparities across languages and suffer from a critical off-target issue-generating responses in unintended languages.<n>Our findings offer new insights and practical strategies for advancing the multilingual reasoning capabilities of large language models.
arXiv Detail & Related papers (2025-05-25T12:47:39Z) - Language Matters: How Do Multilingual Input and Reasoning Paths Affect Large Reasoning Models? [59.970391602080205]
Despite multilingual training, LRMs tend to default to reasoning in high-resource languages at test time.<n>Cultural reasoning degrades performance on reasoning tasks but benefits cultural tasks, while safety evaluations exhibit language-specific behavior.
arXiv Detail & Related papers (2025-05-23T02:46:18Z) - When Less Language is More: Language-Reasoning Disentanglement Makes LLMs Better Multilingual Reasoners [111.50503126693444]
We show that language-specific ablation consistently boosts multilingual reasoning performance.<n>Compared to post-training, our training-free ablation achieves comparable or superior results with minimal computational overhead.
arXiv Detail & Related papers (2025-05-21T08:35:05Z) - Crosslingual Reasoning through Test-Time Scaling [51.55526326294275]
We find that scaling up inference compute for English-centric reasoning language models (RLMs) improves multilingual mathematical reasoning across many languages.<n>While English-centric RLM's CoTs are naturally predominantly English, they consistently follow a quote-and-think pattern to reason about quoted non-English inputs.<n>We observe poor out-of-domain reasoning generalization, in particular from STEM to cultural commonsense knowledge, even for English.
arXiv Detail & Related papers (2025-05-08T16:50:06Z) - Could Thinking Multilingually Empower LLM Reasoning? [41.62726542483646]
We explore the upper bound of harnessing multilingualism in reasoning tasks.<n>We find that multilingual reasoning promises significantly (by nearly 10 Acc@$k$ points) and robustly (tolerance for variations in translation quality and language choice) higher upper bounds than English-only reasoning.
arXiv Detail & Related papers (2025-04-16T07:45:10Z) - Lost in Multilinguality: Dissecting Cross-lingual Factual Inconsistency in Transformer Language Models [49.16690802656554]
We find that Multilingual factual models struggle to provide consistent responses to semantically equivalent prompts in different languages.<n>We propose a linear shortcut method that bypasses computations in the final layers, enhancing both prediction accuracy and cross-lingual consistency.
arXiv Detail & Related papers (2025-04-05T19:43:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.