Fugu-MT 論文翻訳(概要): Exploiting Verification-Generation Gap: Test-Time Reinforcement Learning with Confidence-Conditioned Verification

論文の概要: Exploiting Verification-Generation Gap: Test-Time Reinforcement Learning with Confidence-Conditioned Verification

arxiv url: http://arxiv.org/abs/2606.03608v1
Date: Tue, 02 Jun 2026 13:11:09 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-03 22:00:05.0164
Title: Exploiting Verification-Generation Gap: Test-Time Reinforcement Learning with Confidence-Conditioned Verification
Title（参考訳）: 爆発的検証-生成ギャップ:信頼度を考慮したテスト時間強化学習
Authors: Jiahui Li, Jianfeng Shan, Wenpei Chen, Shunyu Wu, Jian Lou, Wenjie Feng, Dan Li, See-Kiong Ng,
Abstract要約: テスト時強化学習は、大規模言語モデルの推論能力を高めるための有望なパラダイムとして登場した。本稿では,TTRL-CoCoV(Test-Time Reinforcement Learning with Confidence-Conditioned Verification)を提案する。
参考スコア（独自算出の注目度）: 41.12429744792745
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Test-time reinforcement learning has emerged as a promising paradigm for enhancing the complex reasoning abilities of large language models in a completely label-free manner. Despite existing studies focusing on Pass@1 performance, optimizing Pass@k remains under-explored yet critical in label-free settings, which measures generation coverage for sustained exploration. Optimizing Pass@k in label-free setting is highly non-trivial, as directly applying the Pass@k advantage designs effective for RLVR yields unsatisfactory performance. Through in-depth empirical analysis, we discover the root causes hindering performance: pseudo-label estimations for low-confidence samples have a high probability of being incorrect, while candidate answers for high-confidence samples suffer from severe diversity collapse. To overcome these hurdles, we propose TTRL-CoCoV (Test-Time Reinforcement Learning with Confidence-Conditioned Verification), a novel confidence-adaptive framework that expands Pass@k coverage and improves Pass@1 performance. Based on our key insight that verification capability generally leads generation capability, TTRL-CoCoV employs a confidence-conditioned mechanism: for high-confidence samples, it bootstraps verifier and applies an exploration-enhancing reward to prevent diversity collapse; for low-confidence samples, it delegates pseudo-label selection to the verifier to filter incorrect pseudo-labels; and for medium-confidence samples, it bypasses verification entirely. Extensive experiments demonstrate that TTRL-CoCoV outperforms the best competing methods across 6 widely-recognized benchmarks, achieves average absolute gains of +9.8% in Pass@1 and +18.7% in Pass@16 over TTRL, and even achieves absolute Pass@1 improvements of up to +5.0% across multiple reasoning benchmarks when compared against fully supervised RL methods. Our code repository: https://github.com/shanjf666/CoCoV.
Abstract（参考訳）: テスト時強化学習は,大規模言語モデルの複雑な推論能力を,ラベルのない方法で向上させる,有望なパラダイムとして登場した。 Pass@1パフォーマンスに焦点を当てた既存の研究にもかかわらず、Pass@kの最適化はラベルなしの環境ではいまだに重要であり、持続的な探索のための生成カバレッジを計測している。ラベルなし設定でのPass@kの最適化は、RLVRに有効なPass@kのアドバンテージ設計を直接適用することで、不満足なパフォーマンスが得られるため、非常に簡単ではない。低信頼度サンプルに対する擬似ラベル推定は誤りの確率が高く、高信頼度サンプルに対する候補回答は深刻な多様性崩壊に悩まされる。これらのハードルを克服するために、我々は、Pass@kカバレッジを拡張し、Pass@1パフォーマンスを改善する新しい信頼性適応フレームワークであるTTRL-CoCoV(Test-Time Reinforcement Learning with Confidence-Conditioned Verification)を提案する。 TTRL-CoCoVは、検証能力が一般的に生成能力を導くというキーポイントに基づいて、信頼度の高いメカニズムを採用している: 高信頼度サンプルでは、検証をブートストラップし、多様性の崩壊を防ぐために探索エンハンスな報酬を適用し、低信頼度サンプルでは、疑似ラベルの選択を検証者に委譲して不正な擬似ラベルをフィルタリングし、中信頼度サンプルでは、検証を完全に回避する。大規模な実験では、TTRL-CoCoVは6つの広く認識されているベンチマークで最高の競合する手法より優れており、Pass@1では平均で+9.8%、TTRLでは+18.7%、TTRLではPass@16では+5.0%、完全に教師されたRLメソッドでは最大で+5.0%の絶対的なPass@1改善を達成している。私たちのコードリポジトリは、https://github.com/shanjf666/CoCoVです。

論文の概要: Exploiting Verification-Generation Gap: Test-Time Reinforcement Learning with Confidence-Conditioned Verification

関連論文リスト