Fugu-MT 論文翻訳(概要): Training LLMs for Divide-and-Conquer Reasoning Elevates Test-Time Scalability

論文の概要: Training LLMs for Divide-and-Conquer Reasoning Elevates Test-Time Scalability

arxiv url: http://arxiv.org/abs/2602.02477v1
Date: Mon, 02 Feb 2026 18:54:54 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-03 19:28:34.386945
Title: Training LLMs for Divide-and-Conquer Reasoning Elevates Test-Time Scalability
Title（参考訳）: ディバイド・アンド・コンカレント推論のためのLDMのトレーニングがテスト時間スケーラビリティを向上
Authors: Xiao Liang, Zhong-Zhi Li, Zhenghao Lin, Eric Hancheng Jiang, Hengyuan Zhang, Yelong Shen, Kai-Wei Chang, Ying Nian Wu, Yeyun Gong, Weizhu Chen,
Abstract要約: 大規模言語モデル(LLM)は、ステップ・バイ・ステップ・チェーン・オブ・シークレット(CoT)推論を通じて強力な推論能力を示している。潜在的には、解のより効率的な探索を容易にするために複雑な問題をサブプロブレムに分解するDAC推論がある。本稿では,DAC型推論能力を高めるために,エンドツーエンド強化学習(RL)フレームワークを提案する。
参考スコア（独自算出の注目度）: 129.1296673737603
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have demonstrated strong reasoning capabilities through step-by-step chain-of-thought (CoT) reasoning. Nevertheless, at the limits of model capability, CoT often proves insufficient, and its strictly sequential nature constrains test-time scalability. A potential alternative is divide-and-conquer (DAC) reasoning, which decomposes a complex problem into subproblems to facilitate more effective exploration of the solution. Although promising, our analysis reveals a fundamental misalignment between general-purpose post-training and DAC-style inference, which limits the model's capacity to fully leverage this potential. To bridge this gap and fully unlock LLMs' reasoning capabilities on the most challenging tasks, we propose an end-to-end reinforcement learning (RL) framework to enhance their DAC-style reasoning capacity. At each step, the policy decomposes a problem into a group of subproblems, solves them sequentially, and addresses the original one conditioned on the subproblem solutions, with both decomposition and solution integrated into RL training. Under comparable training, our DAC-style framework endows the model with a higher performance ceiling and stronger test-time scalability, surpassing CoT by 8.6% in Pass@1 and 6.3% in Pass@32 on competition-level benchmarks.
Abstract（参考訳）: 大規模言語モデル(LLM)は、ステップ・バイ・ステップ・チェーン・オブ・シークレット(CoT)推論を通じて強力な推論能力を示している。それでも、モデル能力の限界において、CoTは不十分であることがしばしば証明され、その厳密な性質はテスト時のスケーラビリティを制約します。潜在的には、解のより効率的な探索を容易にするために複雑な問題をサブプロブレムに分解するDAC推論がある。有望ではあるが,本分析では,汎用ポストトレーニングとDACスタイルの推論の根本的な相違が明らかとなり,この可能性を完全に活用する能力が制限される。このギャップを埋め、最も困難なタスクにおいてLLMの推論能力を完全に解放するために、DACスタイルの推論能力を高めるために、エンドツーエンド強化学習(RL)フレームワークを提案する。各ステップにおいて、ポリシーは問題をサブプロブレムのグループに分解し、それらを順次解決し、分解と解の両方をRLトレーニングに統合して、サブプロブレム解に条件付けられた元の問題に対処する。同等のトレーニングの下では、当社のDACスタイルのフレームワークは、パフォーマンスの天井が高く、テストタイムのスケーラビリティが強く、競合レベルのベンチマークではPass@1でCoTが8.6%、Pass@32で6.3%を超えています。

論文の概要: Training LLMs for Divide-and-Conquer Reasoning Elevates Test-Time Scalability

関連論文リスト