Related papers: EfficientXLang: Towards Improving Token Efficiency Through Cross-Lingual Reasoning

EfficientXLang: Towards Improving Token Efficiency Through Cross-Lingual Reasoning

URL: http://arxiv.org/abs/2507.00246v1
Date: Mon, 30 Jun 2025 20:29:52 GMT
Title: EfficientXLang: Towards Improving Token Efficiency Through Cross-Lingual Reasoning
Authors: Sanchit Ahuja, Praneetha Vaddamanu, Barun Patra,
Abstract summary: We investigate whether English is the most token-efficient language for reasoning.<n>We find that reasoning in non-English languages not only reduces token usage, but also preserves accuracy.<n>The extent of improvement depends on the models multilingual strength.
Score: 12.511775058257328
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite recent advances in Language Reasoning Models (LRMs), most research focuses solely on English, even though many models are pretrained on multilingual data. In this work, we investigate: Is English the most token-efficient language for reasoning? We evaluate three open-source RLMs: DeepSeek R1, Qwen 2.5 and Qwen 3, across four math datasets and seven typologically diverse languages. We find that reasoning in non-English languages not only reduces token usage, but also preserves accuracy. These gains persist even after translating the reasoning traces into English, suggesting genuine shifts in reasoning behavior rather than surface-level linguistic effects. The extent of improvement, however, depends on the models multilingual strength. Our findings motivate a broader view of reasoning in language models, highlighting the potential of multilingual reasoning and the importance of strong multilingual foundations. The code for our work can be found: https://github.com/microsoft/EfficientXLang.

Related papers

The Impact of Language Mixing on Bilingual LLM Reasoning [4.495689119099099]
We study language switching in Chinese-English bilingual reasoning models.<n> enforcing monolingual decoding reduces accuracy by 5.6 percentage points on math reasoning tasks.<n>A lightweight probe can be trained to predict whether a potential language switch would benefit or harm reasoning.
arXiv Detail & Related papers (2025-07-21T17:56:09Z)
Learn Globally, Speak Locally: Bridging the Gaps in Multilingual Reasoning [38.52080213211765]
We introduce GeoFact-X, a geography-based multilingual factual reasoning benchmark with annotated reasoning traces in five languages.<n>We propose BRIDGE, a novel training method that guides supervised fine-tuning and test-time reinforcement learning.<n>Our results show that BRIDGE significantly enhances multilingual reasoning fidelity.
arXiv Detail & Related papers (2025-07-07T19:04:36Z)
When Models Reason in Your Language: Controlling Thinking Trace Language Comes at the Cost of Accuracy [9.021965237274244]
Large Reasoning Models (LRMs) with thinking traces have shown strong performance on English reasoning tasks.<n>This capability is as important as answer accuracy for real world applications because users may find the reasoning trace useful for oversight only when it is expressed in their own language.<n>We evaluate two leading families of LRMs on our XReasoning benchmark and find that even the most advanced models often revert to English or produce fragmented reasoning in other languages.
arXiv Detail & Related papers (2025-05-28T21:44:12Z)
MMATH: A Multilingual Benchmark for Mathematical Reasoning [94.05289799605957]
We introduce MMATH, a benchmark for multilingual complex reasoning spanning 374 high-quality math problems across 10 typologically diverse languages.<n>We observe that even advanced models like DeepSeek R1 exhibit substantial performance disparities across languages and suffer from a critical off-target issue-generating responses in unintended languages.<n>Our findings offer new insights and practical strategies for advancing the multilingual reasoning capabilities of large language models.
arXiv Detail & Related papers (2025-05-25T12:47:39Z)
Language Matters: How Do Multilingual Input and Reasoning Paths Affect Large Reasoning Models? [59.970391602080205]
Despite multilingual training, LRMs tend to default to reasoning in high-resource languages at test time.<n>Cultural reasoning degrades performance on reasoning tasks but benefits cultural tasks, while safety evaluations exhibit language-specific behavior.
arXiv Detail & Related papers (2025-05-23T02:46:18Z)
When Less Language is More: Language-Reasoning Disentanglement Makes LLMs Better Multilingual Reasoners [111.50503126693444]
We show that language-specific ablation consistently boosts multilingual reasoning performance.<n>Compared to post-training, our training-free ablation achieves comparable or superior results with minimal computational overhead.
arXiv Detail & Related papers (2025-05-21T08:35:05Z)
Crosslingual Reasoning through Test-Time Scaling [51.55526326294275]
We find that scaling up inference compute for English-centric reasoning language models (RLMs) improves multilingual mathematical reasoning across many languages.<n>While English-centric RLM's CoTs are naturally predominantly English, they consistently follow a quote-and-think pattern to reason about quoted non-English inputs.<n>We observe poor out-of-domain reasoning generalization, in particular from STEM to cultural commonsense knowledge, even for English.
arXiv Detail & Related papers (2025-05-08T16:50:06Z)
Could Thinking Multilingually Empower LLM Reasoning? [41.62726542483646]
We explore the upper bound of harnessing multilingualism in reasoning tasks.<n>We find that multilingual reasoning promises significantly (by nearly 10 Acc@$k$ points) and robustly (tolerance for variations in translation quality and language choice) higher upper bounds than English-only reasoning.
arXiv Detail & Related papers (2025-04-16T07:45:10Z)
Dictionary Insertion Prompting for Multilingual Reasoning on Multilingual Large Language Models [52.00446751692225]
We present a novel and simple yet effective method called textbfDictionary textbfInsertion textbfPrompting (textbfDIP) When providing a non-English prompt, DIP looks up a word dictionary and inserts words' English counterparts into the prompt for LLMs. It then enables better translation into English and better English model thinking steps which leads to obviously better results.
arXiv Detail & Related papers (2024-11-02T05:10:50Z)
Could We Have Had Better Multilingual LLMs If English Was Not the Central Language? [4.655168524016426]
Large Language Models (LLMs) demonstrate strong machine translation capabilities on languages they are trained on. Our study delves into Llama2's translation capabilities. Our experiments show that the 7B Llama2 model yields above 10 BLEU when translating into all languages it has seen.
arXiv Detail & Related papers (2024-02-21T16:32:38Z)
Question Translation Training for Better Multilingual Reasoning [108.10066378240879]
Large language models show compelling performance on reasoning tasks but they tend to perform much worse in languages other than English. A typical solution is to translate instruction data into all languages of interest, and then train on the resulting multilingual data, which is called translate-training. In this paper we explore the benefits of question alignment, where we train the model to translate reasoning questions into English by finetuning on X-English parallel question data.
arXiv Detail & Related papers (2024-01-15T16:39:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.