Related papers: Long Chain-of-Thought Reasoning Across Languages

Long Chain-of-Thought Reasoning Across Languages

URL: http://arxiv.org/abs/2508.14828v2
Date: Thu, 09 Oct 2025 05:36:20 GMT
Title: Long Chain-of-Thought Reasoning Across Languages
Authors: Josh Barua, Seun Eisape, Kayo Yin, Alane Suhr,
Abstract summary: We investigate four key stages of model development: scaling, pretraining, post-training, and inference.<n>We find that scaling reasoning model size improves multilingual task performance in En-CoT, but Target-CoT performance lags behind.<n>Given the scarcity of high-quality reasoning traces in languages other than English, we explore synthetic data curation approaches for post-training.
Score: 14.79632337642471
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While large reasoning models have shown remarkable ability to generate long chains-of-thought (CoTs) in English, we still lack understanding of how these long-form reasoning abilities transfer to the vast majority of the world's languages. In this work, we systematically investigate four key stages of model development--scaling, pretraining, post-training, and inference--to understand how long CoT capabilities extend beyond English. We compare two reasoning settings across nine non-English target languages: En-CoT, where models process target-language inputs, but reason in English; and Target-CoT, where models both process inputs and generate long CoTs in the target language. We find that scaling reasoning model size improves multilingual task performance in En-CoT, but Target-CoT performance lags behind. This gap widens for tasks requiring long, multi-step CoTs such as mathematical reasoning. Shifting to pretraining, we find that adding a specialized reasoning stage enhances En-CoT performance but degrades Target-CoT, whereas broad multilingual pretraining improves both modes simultaneously. Given the scarcity of high-quality reasoning traces in languages other than English, we explore synthetic data curation approaches for post-training. We demonstrate that fine-tuning on reasoning traces automatically translated from gold English traces outperforms fine-tuning on target-language traces distilled from large reasoning models. Finally, we report disparities in inference efficiency between languages and uncover language-specific failure modes in CoTs. We release models, datasets, and code to foster further research.

Related papers

A Comprehensive Evaluation of Multilingual Chain-of-Thought Reasoning: Performance, Consistency, and Faithfulness Across Languages [48.68444770923683]
We present the first comprehensive study of multilingual Chain-of-Thought (CoT) reasoning.<n>We measure language compliance, answer accuracy, and answer consistency when LRMs are prompt-hacked to think in a target language.<n>We find that the quality and effectiveness of thinking traces vary substantially depending on the prompt language.
arXiv Detail & Related papers (2025-10-10T17:06:50Z)
EfficientXLang: Towards Improving Token Efficiency Through Cross-Lingual Reasoning [12.511775058257328]
We investigate whether English is the most token-efficient language for reasoning.<n>We find that reasoning in non-English languages not only reduces token usage, but also preserves accuracy.<n>The extent of improvement depends on the models multilingual strength.
arXiv Detail & Related papers (2025-06-30T20:29:52Z)
Cross-Lingual Pitfalls: Automatic Probing Cross-Lingual Weakness of Multilingual Large Language Models [55.14276067678253]
This paper introduces a novel methodology for efficiently identifying inherent cross-lingual weaknesses in Large Language Models (LLMs)<n>We construct a new dataset of over 6,000 bilingual pairs across 16 languages using this methodology, demonstrating its effectiveness in revealing weaknesses even in state-of-the-art models.<n>Further experiments investigate the relationship between linguistic similarity and cross-lingual weaknesses, revealing that linguistically related languages share similar performance patterns.
arXiv Detail & Related papers (2025-05-24T12:31:27Z)
Language Matters: How Do Multilingual Input and Reasoning Paths Affect Large Reasoning Models? [59.970391602080205]
Despite multilingual training, LRMs tend to default to reasoning in high-resource languages at test time.<n>Cultural reasoning degrades performance on reasoning tasks but benefits cultural tasks, while safety evaluations exhibit language-specific behavior.
arXiv Detail & Related papers (2025-05-23T02:46:18Z)
Tracing Multilingual Factual Knowledge Acquisition in Pretraining [62.95057983661562]
Large Language Models (LLMs) are capable of recalling multilingual factual knowledge present in their pretraining data.<n>We trace how factual recall and crosslingual consistency evolve during pretraining, focusing on OLMo-7B.<n>We find that both accuracy and consistency improve over time for most languages.
arXiv Detail & Related papers (2025-05-20T18:39:56Z)
Crosslingual Reasoning through Test-Time Scaling [51.55526326294275]
We find that scaling up inference compute for English-centric reasoning language models (RLMs) improves multilingual mathematical reasoning across many languages.<n>While English-centric RLM's CoTs are naturally predominantly English, they consistently follow a quote-and-think pattern to reason about quoted non-English inputs.<n>We observe poor out-of-domain reasoning generalization, in particular from STEM to cultural commonsense knowledge, even for English.
arXiv Detail & Related papers (2025-05-08T16:50:06Z)
Scaling Test-time Compute for Low-resource Languages: Multilingual Reasoning in LLMs [3.9530780161144667]
We investigate the multilingual mechanism by which Large Language Models internally operate in a latent space biased toward their inherently dominant language.<n>We train models to generate the chain-of-thought (CoT) in English while outputting the final response in the target language, given input in the low-resource language.<n>Our experiments demonstrate that this approach, named English-Pivoted CoT Training, outperforms other baselines, with up to 28.33% improvement.
arXiv Detail & Related papers (2025-04-02T16:58:36Z)
Demystifying Multilingual Chain-of-Thought in Process Reward Modeling [86.98098988779809]
We tackle the challenge of extending process reward models (PRMs) to multilingual settings.<n>We train multilingual PRMs on a dataset spanning seven languages, which is translated from English.<n>Our results highlight the sensitivity of multilingual PRMs to both the number of training languages and the volume of English data.
arXiv Detail & Related papers (2025-02-18T09:11:44Z)
AdaMCoT: Rethinking Cross-Lingual Factual Reasoning through Adaptive Multilingual Chain-of-Thought [40.16140566668239]
We introduce AdaMCOT, a framework that enhances multilingual factual reasoning.<n>AdaMCOT dynamically routing thought processes in intermediary "thinking languages" before generating target-language responses.<n>Our evaluation demonstrates substantial improvements in both factual reasoning quality and cross-lingual consistency.
arXiv Detail & Related papers (2025-01-27T15:48:57Z)
mCoT: Multilingual Instruction Tuning for Reasoning Consistency in Language Models [21.616940026409818]
Large language models (LLMs) with Chain-of-thought (CoT) have recently emerged as a powerful technique for eliciting reasoning to improve downstream tasks. We study multilingual reasoning consistency across multiple languages, using popular open-source LLMs. We introduce multilingual CoT instruction tuning to boost reasoning capability across languages, thereby improving model consistency.
arXiv Detail & Related papers (2024-06-04T13:30:45Z)
Could We Have Had Better Multilingual LLMs If English Was Not the Central Language? [4.655168524016426]
Large Language Models (LLMs) demonstrate strong machine translation capabilities on languages they are trained on. Our study delves into Llama2's translation capabilities. Our experiments show that the 7B Llama2 model yields above 10 BLEU when translating into all languages it has seen.
arXiv Detail & Related papers (2024-02-21T16:32:38Z)
Question Translation Training for Better Multilingual Reasoning [108.10066378240879]
Large language models show compelling performance on reasoning tasks but they tend to perform much worse in languages other than English. A typical solution is to translate instruction data into all languages of interest, and then train on the resulting multilingual data, which is called translate-training. In this paper we explore the benefits of question alignment, where we train the model to translate reasoning questions into English by finetuning on X-English parallel question data.
arXiv Detail & Related papers (2024-01-15T16:39:10Z)
xCoT: Cross-lingual Instruction Tuning for Cross-lingual Chain-of-Thought Reasoning [36.34986831526529]
Chain-of-thought (CoT) has emerged as a powerful technique to elicit reasoning in large language models. We propose a cross-lingual instruction fine-tuning framework (xCOT) to transfer knowledge from high-resource languages to low-resource languages.
arXiv Detail & Related papers (2024-01-13T10:53:53Z)
GATE: Graph Attention Transformer Encoder for Cross-lingual Relation and Event Extraction [107.8262586956778]
We introduce graph convolutional networks (GCNs) with universal dependency parses to learn language-agnostic sentence representations. GCNs struggle to model words with long-range dependencies or are not directly connected in the dependency tree. We propose to utilize the self-attention mechanism to learn the dependencies between words with different syntactic distances.
arXiv Detail & Related papers (2020-10-06T20:30:35Z)
Knowledge Distillation for Multilingual Unsupervised Neural Machine Translation [61.88012735215636]
Unsupervised neural machine translation (UNMT) has recently achieved remarkable results for several language pairs. UNMT can only translate between a single language pair and cannot produce translation results for multiple language pairs at the same time. In this paper, we empirically introduce a simple method to translate between thirteen languages using a single encoder and a single decoder.
arXiv Detail & Related papers (2020-04-21T17:26:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.