Fugu-MT 論文翻訳(概要): Prepare Reasoning Language Models for Multi-Agent Debate with Self-Debate Reinforcement Learning

論文の概要: Prepare Reasoning Language Models for Multi-Agent Debate with Self-Debate Reinforcement Learning

arxiv url: http://arxiv.org/abs/2601.22297v1
Date: Thu, 29 Jan 2026 20:21:44 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-02 18:28:15.047957
Title: Prepare Reasoning Language Models for Multi-Agent Debate with Self-Debate Reinforcement Learning
Title（参考訳）: 自己決定強化学習を用いた多言語議論のための事前推論言語モデル
Authors: Chenxi Liu, Yanshuo Chen, Ruibo Chen, Tianyi Xiong, Tong Zheng, Heng Huang,
Abstract要約: 自己決定強化学習(Self-Debate Reinforcement Learning, SDRL)は、強力な問題解決能力を持つ単一の大規模言語モデルを備えたトレーニングフレームワークである。我々は,SDRLが単一モデル推論を同時に強化しつつ,総合的マルチエージェント議論(MAD)性能を向上させることを示す。
参考スコア（独自算出の注目度）: 49.99694105650486
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The reasoning abilities of large language models (LLMs) have been substantially improved by reinforcement learning with verifiable rewards (RLVR). At test time, collaborative reasoning through Multi-Agent Debate (MAD) has emerged as a promising approach for enhancing LLM performance. However, current RLVR methods typically train LLMs to solve problems in isolation, without explicitly preparing them to synthesize and benefit from different rationales that arise during debate. In this work, we propose Self-Debate Reinforcement Learning (SDRL), a training framework that equips a single LLM with strong standalone problem-solving ability and the capability to learn from diverse reasoning trajectories in MAD. Given a prompt, SDRL first samples multiple candidate solutions, then constructs a debate context with diverse reasoning paths and generates second-turn responses conditioned on this context. Finally, SDRL jointly optimizes both the initial and debate-conditioned responses, yielding a model that is effective as both a standalone solver and a debate participant. Experiments across multiple base models and reasoning benchmarks show that SDRL improves overall MAD performance while simultaneously strengthening single model reasoning.
Abstract（参考訳）: 大規模言語モデル(LLM)の推論能力は、検証可能な報酬(RLVR)による強化学習によって大幅に改善されている。テスト時には,マルチエージェント・ディベート(MAD)による協調推論がLCM性能向上のための有望なアプローチとして出現している。しかし、現在のRLVR法は、通常、議論中に生じる様々な理性から恩恵を受けるように明示的に準備することなく、単独でLLMを訓練する。本研究は,MADの多種多様な推論軌道から学習する能力と,単独のLCMに強力なスタンドアロン問題解決能力を備えた学習フレームワークであるSDRL(Self-Debate Reinforcement Learning)を提案する。プロンプトが与えられた後、SDRLはまず複数の候補解をサンプリングし、様々な推論経路を持つ議論コンテキストを構築し、このコンテキストで条件付けられた第2のターン応答を生成する。最後に、SDRLは初期と議論条件の両方の応答を共同で最適化し、独立解法と議論参加者の両方に有効であるモデルを生成する。複数のベースモデルと推論ベンチマークによる実験により、SDRLは単一のモデル推論を同時に強化しながら、全体的なMAD性能を改善することが示された。

論文の概要: Prepare Reasoning Language Models for Multi-Agent Debate with Self-Debate Reinforcement Learning

関連論文リスト