Could Thinking Multilingually Empower LLM Reasoning?
- URL: http://arxiv.org/abs/2504.11833v1
- Date: Wed, 16 Apr 2025 07:45:10 GMT
- Title: Could Thinking Multilingually Empower LLM Reasoning?
- Authors: Changjiang Gao, Xu Huang, Wenhao Zhu, Shujian Huang, Lei Li, Fei Yuan,
- Abstract summary: We explore the upper bound of harnessing multilingualism in reasoning tasks.<n>We find that multilingual reasoning promises significantly (by nearly 10 Acc@$k$ points) and robustly (tolerance for variations in translation quality and language choice) higher upper bounds than English-only reasoning.
- Score: 41.62726542483646
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Previous work indicates that large language models exhibit a significant "English bias", i.e. they often perform better when tasks are presented in English. Interestingly, we have observed that using certain other languages in reasoning tasks can yield better performance than English. However, this phenomenon remains under-explored. In this paper, we explore the upper bound of harnessing multilingualism in reasoning tasks, suggesting that multilingual reasoning promises significantly (by nearly 10 Acc@$k$ points) and robustly (tolerance for variations in translation quality and language choice) higher upper bounds than English-only reasoning. Besides analyzing the reason behind the upper bound and challenges in reaching it, we also find that common answer selection methods cannot achieve this upper bound, due to their limitations and biases. These insights could pave the way for future research aimed at fully harnessing the potential of multilingual reasoning in LLMs.
Related papers
- MultiNRC: A Challenging and Native Multilingual Reasoning Evaluation Benchmark for LLMs [56.87573414161703]
We introduce the Multilingual Native Reasoning Challenge (MultiNRC), a benchmark to assess Large Language Models (LLMs)<n>MultiNRC covers four core reasoning categories: language-specific linguistic reasoning, wordplay & riddles, cultural/tradition reasoning, and math reasoning with cultural relevance.<n>For cultural/tradition reasoning and math reasoning with cultural relevance, we also provide English equivalent translations of the multilingual questions by manual translation from native speakers fluent in English.
arXiv Detail & Related papers (2025-07-23T12:56:31Z) - Learn Globally, Speak Locally: Bridging the Gaps in Multilingual Reasoning [38.52080213211765]
We introduce GeoFact-X, a geography-based multilingual factual reasoning benchmark with annotated reasoning traces in five languages.<n>We propose BRIDGE, a novel training method that guides supervised fine-tuning and test-time reinforcement learning.<n>Our results show that BRIDGE significantly enhances multilingual reasoning fidelity.
arXiv Detail & Related papers (2025-07-07T19:04:36Z) - EfficientXLang: Towards Improving Token Efficiency Through Cross-Lingual Reasoning [12.511775058257328]
We investigate whether English is the most token-efficient language for reasoning.<n>We find that reasoning in non-English languages not only reduces token usage, but also preserves accuracy.<n>The extent of improvement depends on the models multilingual strength.
arXiv Detail & Related papers (2025-06-30T20:29:52Z) - When Models Reason in Your Language: Controlling Thinking Trace Language Comes at the Cost of Accuracy [9.021965237274244]
Large Reasoning Models (LRMs) with thinking traces have shown strong performance on English reasoning tasks.<n>This capability is as important as answer accuracy for real world applications because users may find the reasoning trace useful for oversight only when it is expressed in their own language.<n>We evaluate two leading families of LRMs on our XReasoning benchmark and find that even the most advanced models often revert to English or produce fragmented reasoning in other languages.
arXiv Detail & Related papers (2025-05-28T21:44:12Z) - Language Matters: How Do Multilingual Input and Reasoning Paths Affect Large Reasoning Models? [59.970391602080205]
Despite multilingual training, LRMs tend to default to reasoning in high-resource languages at test time.<n>Cultural reasoning degrades performance on reasoning tasks but benefits cultural tasks, while safety evaluations exhibit language-specific behavior.
arXiv Detail & Related papers (2025-05-23T02:46:18Z) - When Less Language is More: Language-Reasoning Disentanglement Makes LLMs Better Multilingual Reasoners [111.50503126693444]
We show that language-specific ablation consistently boosts multilingual reasoning performance.<n>Compared to post-training, our training-free ablation achieves comparable or superior results with minimal computational overhead.
arXiv Detail & Related papers (2025-05-21T08:35:05Z) - Crosslingual Reasoning through Test-Time Scaling [51.55526326294275]
We find that scaling up inference compute for English-centric reasoning language models (RLMs) improves multilingual mathematical reasoning across many languages.<n>While English-centric RLM's CoTs are naturally predominantly English, they consistently follow a quote-and-think pattern to reason about quoted non-English inputs.<n>We observe poor out-of-domain reasoning generalization, in particular from STEM to cultural commonsense knowledge, even for English.
arXiv Detail & Related papers (2025-05-08T16:50:06Z) - Assessing Large Language Models in Agentic Multilingual National Bias [31.67058518564021]
Cross-language disparities in reasoning-based recommendations remain largely unexplored.<n>This study is the first to address this gap.<n>We investigate multilingual bias in state-of-the-art LLMs by analyzing their responses to decision-making tasks across multiple languages.
arXiv Detail & Related papers (2025-02-25T08:07:42Z) - Demystifying Multilingual Chain-of-Thought in Process Reward Modeling [71.12193680015622]
We tackle the challenge of extending process reward models (PRMs) to multilingual settings.
We train multilingual PRMs on a dataset spanning seven languages, which is translated from English.
Our results highlight the sensitivity of multilingual PRMs to both the number of training languages and the volume of English data.
arXiv Detail & Related papers (2025-02-18T09:11:44Z) - The Multilingual Mind : A Survey of Multilingual Reasoning in Language Models [18.399229357408043]
Multilingual reasoning requires language models to handle logical reasoning across languages.<n>This survey provides the first in-depth review of multilingual reasoning in Language Models.
arXiv Detail & Related papers (2025-02-13T16:25:16Z) - Understanding and Mitigating Language Confusion in LLMs [76.96033035093204]
We evaluate 15 typologically diverse languages with existing and newly-created English and multilingual prompts.
We find that Llama Instruct and Mistral models exhibit high degrees of language confusion.
We find that language confusion can be partially mitigated via few-shot prompting, multilingual SFT and preference tuning.
arXiv Detail & Related papers (2024-06-28T17:03:51Z) - The Power of Question Translation Training in Multilingual Reasoning: Broadened Scope and Deepened Insights [108.40766216456413]
We propose a question alignment framework to bridge the gap between large language models' English and non-English performance.
Experiment results show it can boost multilingual performance across diverse reasoning scenarios, model families, and sizes.
We analyze representation space, generated response and data scales, and reveal how question translation training strengthens language alignment within LLMs.
arXiv Detail & Related papers (2024-05-02T14:49:50Z) - LogicBench: Towards Systematic Evaluation of Logical Reasoning Ability of Large Language Models [52.03659714625452]
Recently developed large language models (LLMs) have been shown to perform remarkably well on a wide range of language understanding tasks.
But, can they really "reason" over the natural language?
This question has been receiving significant research attention and many reasoning skills such as commonsense, numerical, and qualitative have been studied.
arXiv Detail & Related papers (2024-04-23T21:08:49Z) - Is Translation All You Need? A Study on Solving Multilingual Tasks with Large Language Models [79.46179534911019]
Large language models (LLMs) have demonstrated multilingual capabilities; yet, they are mostly English-centric due to imbalanced training corpora.
This work extends the evaluation from NLP tasks to real user queries.
For culture-related tasks that need deep language understanding, prompting in the native language tends to be more promising.
arXiv Detail & Related papers (2024-03-15T12:47:39Z) - Could We Have Had Better Multilingual LLMs If English Was Not the Central Language? [4.655168524016426]
Large Language Models (LLMs) demonstrate strong machine translation capabilities on languages they are trained on.
Our study delves into Llama2's translation capabilities.
Our experiments show that the 7B Llama2 model yields above 10 BLEU when translating into all languages it has seen.
arXiv Detail & Related papers (2024-02-21T16:32:38Z) - Breaking the Language Barrier: Improving Cross-Lingual Reasoning with
Structured Self-Attention [18.439771003766026]
We study whether multilingual language models (MultiLMs) can transfer logical reasoning abilities to other languages when they are fine-tuned for reasoning in a different language.
We demonstrate that although MultiLMs can transfer reasoning ability across languages in a monolingual setting, they struggle to transfer reasoning abilities in a code-switched setting.
Following this observation, we propose a novel attention mechanism that uses a dedicated set of parameters to encourage cross-lingual attention in code-switched sequences.
arXiv Detail & Related papers (2023-10-23T18:06:38Z) - Large Language Models are In-Context Semantic Reasoners rather than
Symbolic Reasoners [75.85554779782048]
Large Language Models (LLMs) have excited the natural language and machine learning community over recent years.
Despite of numerous successful applications, the underlying mechanism of such in-context capabilities still remains unclear.
In this work, we hypothesize that the learned textitsemantics of language tokens do the most heavy lifting during the reasoning process.
arXiv Detail & Related papers (2023-05-24T07:33:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.