Fugu-MT 論文翻訳(概要): Long Chain-of-Thought Reasoning Across Languages

論文の概要: Long Chain-of-Thought Reasoning Across Languages

arxiv url: http://arxiv.org/abs/2508.14828v2
Date: Thu, 09 Oct 2025 05:36:20 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-10 15:34:28.63953
Title: Long Chain-of-Thought Reasoning Across Languages
Title（参考訳）: 言語間の長鎖推論
Authors: Josh Barua, Seun Eisape, Kayo Yin, Alane Suhr,
Abstract要約: モデル開発の4つの重要な段階として,スケーリング,事前学習,ポストトレーニング,推論について検討する。スケーリング推論モデルのサイズはEn-CoTの多言語タスク性能を改善するが、Target-CoTのパフォーマンスは遅れている。英語以外の言語で高品質な推論トレースが不足していることを踏まえ,ポストトレーニングのための合成データキュレーション手法について検討する。
参考スコア（独自算出の注目度）: 14.79632337642471
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While large reasoning models have shown remarkable ability to generate long chains-of-thought (CoTs) in English, we still lack understanding of how these long-form reasoning abilities transfer to the vast majority of the world's languages. In this work, we systematically investigate four key stages of model development--scaling, pretraining, post-training, and inference--to understand how long CoT capabilities extend beyond English. We compare two reasoning settings across nine non-English target languages: En-CoT, where models process target-language inputs, but reason in English; and Target-CoT, where models both process inputs and generate long CoTs in the target language. We find that scaling reasoning model size improves multilingual task performance in En-CoT, but Target-CoT performance lags behind. This gap widens for tasks requiring long, multi-step CoTs such as mathematical reasoning. Shifting to pretraining, we find that adding a specialized reasoning stage enhances En-CoT performance but degrades Target-CoT, whereas broad multilingual pretraining improves both modes simultaneously. Given the scarcity of high-quality reasoning traces in languages other than English, we explore synthetic data curation approaches for post-training. We demonstrate that fine-tuning on reasoning traces automatically translated from gold English traces outperforms fine-tuning on target-language traces distilled from large reasoning models. Finally, we report disparities in inference efficiency between languages and uncover language-specific failure modes in CoTs. We release models, datasets, and code to foster further research.
Abstract（参考訳）: 大きな推論モデルは、英語で長い連鎖(CoT)を生成する素晴らしい能力を示してきたが、これらの長文推論能力が世界のほとんどの言語にどのように移行するかについては、まだ理解されていない。本研究では,CoTの能力が英語を超えてどれだけ長いかを理解するために,スケーリング,事前学習,ポストトレーニング,推論という,モデル開発の4つの重要な段階を体系的に検討する。対象言語の入力をモデルが処理するEn-CoTと、ターゲット言語の入力をモデルが処理し、ターゲット言語の長いCoTを生成するTarget-CoTの2つの推論設定を比較した。スケーリング推論モデルのサイズはEn-CoTの多言語タスク性能を改善するが、Target-CoTのパフォーマンスは遅れている。このギャップは、数学的推論のような長い多段階のCoTを必要とするタスクに対して拡大する。事前学習にシフトすると、特別な推論段階を追加することでEn-CoT性能は向上するが、Target-CoTは低下するのに対し、多言語事前学習は両モードを同時に改善することがわかった。英語以外の言語で高品質な推論トレースが不足していることを踏まえ,ポストトレーニングのための合成データキュレーション手法について検討する。本研究は,ゴールド・イングリッシュ・トレースから自動的に翻訳された推論トレースの微調整が,大規模推論モデルから抽出したターゲット言語トレースの微調整よりも優れていることを示す。最後に,言語間の推論効率の相違と,CoTにおける言語固有の障害モードの発見について報告する。さらなる研究を促進するために、モデルやデータセット、コードをリリースしています。

論文の概要: Long Chain-of-Thought Reasoning Across Languages

関連論文リスト