Fugu-MT 論文翻訳(概要): Better LLM Reasoning via Dual-Play

論文の概要: Better LLM Reasoning via Dual-Play

arxiv url: http://arxiv.org/abs/2511.11881v2
Date: Wed, 19 Nov 2025 01:20:49 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-20 13:41:21.087007
Title: Better LLM Reasoning via Dual-Play
Title（参考訳）: デュアルプレイによるLLM推論の高速化
Authors: Zhengxin Zhang, Chengyu Huang, Aochong Oliver Li, Claire Cardie,
Abstract要約: 大規模言語モデルのための新しいデュアルプレイフレームワークPasoDobleを紹介する。パソドブルは、同じベースモデルから2つのモデルを逆行する。実験結果から,PasoDobleはLCMの推理性能を向上させることができることがわかった。
参考スコア（独自算出の注目度）: 13.152283780379278
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have achieved remarkable progress through Reinforcement Learning with Verifiable Rewards (RLVR), yet still rely heavily on external supervision (e.g., curated labels). Adversarial learning, particularly through self-play, offers a promising alternative that enables models to iteratively learn from themselves - thus reducing reliance on external supervision. Dual-play extends adversarial learning by assigning specialized roles to two models and training them against each other, fostering sustained competition and mutual evolution. Despite its promise, adapting dual-play training to LLMs remains limited, largely due to their susceptibility to reward hacking and training instability. In this paper, we introduce PasoDoble, a novel LLM dual-play framework. PasoDoble adversarially trains two models initialized from the same base model: a Proposer, which generates challenging questions with ground-truth answers, and a Solver, which attempts to solve them. We enrich the Proposer with knowledge from a pre-training dataset to ensure the questions' quality and diversity. To avoid reward hacking, the Proposer is rewarded for producing only valid questions that push the Solver's limit, while the Solver is rewarded for solving them correctly, and both are updated jointly. To further enhance training stability, we introduce an optional offline paradigm that decouples Proposer and Solver updates, alternately updating each for several steps while holding the other fixed. Notably, PasoDoble operates without supervision during training. Experimental results show that PasoDoble can improve the reasoning performance of LLMs. Our project page is available at https://hcy123902.github.io/PasoDoble.
Abstract（参考訳）: 大規模言語モデル (LLM) は、Reinforcement Learning with Verifiable Rewards (RLVR) を通じて目覚ましい進歩を遂げている。対戦型学習は、特にセルフプレイを通じて、モデルが自分自身から反復的に学習できるような、有望な代替手段を提供する。デュアルプレイは、2つのモデルに特別な役割を割り当て、互いに訓練し、持続的な競争と相互進化を育むことで、敵対的学習を拡張します。約束にもかかわらず、LLMにデュアルプレイトレーニングを適用することは、主にハッキングやトレーニングの不安定さに報いるため、依然として制限されている。本稿では,新しいLLMデュアルプレイフレームワークPasoDobleを紹介する。 PasoDobleは、同じベースモデルから初期化した2つのモデルを逆行的に訓練する。質問の品質と多様性を保証するために、事前学習データセットからの知識をProposerに豊かにします。報酬のハッキングを避けるため、プロポーラはソルバーの限界を押し上げる有効な質問のみを生成することで報奨を受け、ソルバーはそれらを正しく解くことで報奨を受け、両者は共同で更新される。トレーニングの安定性をさらに向上するために,ProposerとSolverの更新を分離するオプションのオフラインパラダイムを導入する。特に、PasoDobleはトレーニング中に監督なしで運用されている。実験結果から,PasoDobleはLCMの推理性能を向上させることができることがわかった。私たちのプロジェクトページはhttps://hcy123902.github.io/PasoDoble.comで公開されている。

論文の概要: Better LLM Reasoning via Dual-Play

関連論文リスト