Related papers: The Reasoning Lingua Franca: A Double-Edged Sword for Multilingual AI

The Reasoning Lingua Franca: A Double-Edged Sword for Multilingual AI

URL: http://arxiv.org/abs/2510.20647v1
Date: Thu, 23 Oct 2025 15:22:00 GMT
Title: The Reasoning Lingua Franca: A Double-Edged Sword for Multilingual AI
Authors: Alan Saji, Raj Dabre, Anoop Kunchukuttan, Ratish Puduppully,
Abstract summary: Large Reasoning Models (LRMs) achieve strong performance on mathematical, scientific, and other question-answering tasks.<n>When presented with non-English questions, LRMs often default to reasoning in English, raising concerns about interpretability and the handling of linguistic and cultural nuances.
Score: 25.42472949919922
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Reasoning Models (LRMs) achieve strong performance on mathematical, scientific, and other question-answering tasks, but their multilingual reasoning abilities remain underexplored. When presented with non-English questions, LRMs often default to reasoning in English, raising concerns about interpretability and the handling of linguistic and cultural nuances. We systematically compare an LRM's reasoning in English versus the language of the question. Our evaluation spans two tasks: MGSM and GPQA Diamond. Beyond measuring answer accuracy, we also analyze cognitive attributes in the reasoning traces. We find that English reasoning traces exhibit a substantially higher presence of these cognitive behaviors, and that reasoning in English generally yields higher final-answer accuracy, with the performance gap increasing as tasks become more complex. However, this English-centric strategy is susceptible to a key failure mode - getting "Lost in Translation," where translation steps lead to errors that would have been avoided by question's language reasoning.

Related papers

Large Reasoning Models Are (Not Yet) Multilingual Latent Reasoners [48.68444770923683]
Large reasoning models (LRMs) achieve strong performance on mathematical reasoning tasks.<n>LRMs often arrive at the correct answer before completing these textual reasoning steps.<n>This phenomenon has been explored in English, but its multilingual behavior remains largely unknown.
arXiv Detail & Related papers (2026-01-06T13:20:17Z)
Think Natively: Unlocking Multilingual Reasoning with Consistency-Enhanced Reinforcement Learning [85.7304930030649]
We propose M-Thinker, which is trained by a Language Consistency reward and a Cross-lingual Thinking Alignment reward.<n>M-Thinker achieves nearly 100% language consistency and superior performance on two multilingual benchmarks.
arXiv Detail & Related papers (2025-10-08T17:55:02Z)
MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs [56.87573414161703]
We introduce the Multilingual Native Reasoning Challenge (MultiNRC), a benchmark to assess Large Language Models (LLMs)<n>MultiNRC covers four core reasoning categories: language-specific linguistic reasoning, wordplay & riddles, cultural/tradition reasoning, and math reasoning with cultural relevance.<n>For cultural/tradition reasoning and math reasoning with cultural relevance, we also provide English equivalent translations of the multilingual questions by manual translation from native speakers fluent in English.
arXiv Detail & Related papers (2025-07-23T12:56:31Z)
When Models Reason in Your Language: Controlling Thinking Language Comes at the Cost of Accuracy [16.897177356930104]
Large Reasoning Models (LRMs) with thinking traces have shown strong performance on English reasoning tasks.<n>This capability is as important as answer accuracy for real world applications because users may find the reasoning trace useful for oversight only when it is expressed in their own language.<n>We evaluate two leading families of LRMs on our XReasoning benchmark and find that even the most advanced models often revert to English or produce fragmented reasoning in other languages.
arXiv Detail & Related papers (2025-05-28T21:44:12Z)
MMATH: A Multilingual Benchmark for Mathematical Reasoning [94.05289799605957]
We introduce MMATH, a benchmark for multilingual complex reasoning spanning 374 high-quality math problems across 10 typologically diverse languages.<n>We observe that even advanced models like DeepSeek R1 exhibit substantial performance disparities across languages and suffer from a critical off-target issue-generating responses in unintended languages.<n>Our findings offer new insights and practical strategies for advancing the multilingual reasoning capabilities of large language models.
arXiv Detail & Related papers (2025-05-25T12:47:39Z)
Language Matters: How Do Multilingual Input and Reasoning Paths Affect Large Reasoning Models? [59.970391602080205]
Despite multilingual training, LRMs tend to default to reasoning in high-resource languages at test time.<n>Cultural reasoning degrades performance on reasoning tasks but benefits cultural tasks, while safety evaluations exhibit language-specific behavior.
arXiv Detail & Related papers (2025-05-23T02:46:18Z)
Crosslingual Reasoning through Test-Time Scaling [51.55526326294275]
We find that scaling up inference compute for English-centric reasoning language models (RLMs) improves multilingual mathematical reasoning across many languages.<n>While English-centric RLM's CoTs are naturally predominantly English, they consistently follow a quote-and-think pattern to reason about quoted non-English inputs.<n>We observe poor out-of-domain reasoning generalization, in particular from STEM to cultural commonsense knowledge, even for English.
arXiv Detail & Related papers (2025-05-08T16:50:06Z)
Could Thinking Multilingually Empower LLM Reasoning? [41.62726542483646]
We explore the upper bound of harnessing multilingualism in reasoning tasks.<n>We find that multilingual reasoning promises significantly (by nearly 10 Acc@$k$ points) and robustly (tolerance for variations in translation quality and language choice) higher upper bounds than English-only reasoning.
arXiv Detail & Related papers (2025-04-16T07:45:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.