Fugu-MT 論文翻訳(概要): Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR

論文の概要: Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR

arxiv url: http://arxiv.org/abs/2508.14029v1
Date: Tue, 19 Aug 2025 17:42:45 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-20 15:36:32.036026
Title: Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR
Title（参考訳）: Beyond Pass@1: RLVRを持続する変分問題合成によるセルフプレイ
Authors: Xiao Liang, Zhongzhi Li, Yeyun Gong, Yelong Shen, Ying Nian Wu, Zhijiang Guo, Weizhu Chen,
Abstract要約: RLVRトレーニングのためのオンライン変分問題合成(SvS)戦略を提案する。この戦略は、トレーニング中のポリシーのエントロピーを効果的に維持し、標準のRLVRと比較してPass@kを大幅に改善する。
参考スコア（独自算出の注目度）: 102.05010188302428
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a key paradigm for post-training Large Language Models (LLMs), particularly for complex reasoning tasks. However, vanilla RLVR training has been shown to improve Pass@1 performance at the expense of policy entropy, leading to reduced generation diversity and limiting the Pass@k performance, which typically represents the upper bound of LLM reasoning capability. In this paper, we systematically analyze the policy's generation diversity from the perspective of training problems and find that augmenting and updating training problems helps mitigate entropy collapse during training. Based on these observations, we propose an online Self-play with Variational problem Synthesis (SvS) strategy for RLVR training, which uses the policy's correct solutions to synthesize variational problems while ensuring their reference answers remain identical to the originals. This self-improving strategy effectively maintains policy entropy during training and substantially improves Pass@k compared with standard RLVR, sustaining prolonged improvements and achieving absolute gains of 18.3% and 22.8% in Pass@32 performance on the competition-level AIME24 and AIME25 benchmarks. Experiments on 12 reasoning benchmarks across varying model sizes from 3B to 32B consistently demonstrate the generalizability and robustness of SvS.
Abstract（参考訳）: Reinforcement Learning with Verifiable Rewards (RLVR)は、特に複雑な推論タスクにおいて、後学習の大規模言語モデル(LLM)の重要なパラダイムとして最近登場した。しかしながら、バニラRLVRトレーニングは、ポリシーエントロピーを犠牲にしてPass@1のパフォーマンスを改善することが示されており、それによって生成の多様性が減少し、一般的にLLM推論能力の上限を表すPass@kパフォーマンスが制限される。本稿では,トレーニング問題の観点から,政策の世代多様性を体系的に分析し,トレーニング問題の拡大と更新が,トレーニング中のエントロピー崩壊の軽減に有効であることを示す。そこで本研究では,RLVRトレーニングのためのオンラインセルフプレイと変分問題合成(SvS)戦略を提案する。この自己改善戦略は、トレーニング中のポリシーのエントロピーを効果的に維持し、標準のRLVRと比較してPass@kを大幅に改善し、コンペティションレベルのAIME24とAIME25ベンチマークで18.3%と22.8%の絶対的な改善を達成している。 3Bから32Bまでのモデルサイズの異なる12の推論ベンチマーク実験は、SvSの一般化性と堅牢性を一貫して示している。

論文の概要: Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR

関連論文リスト