Fugu-MT 論文翻訳(概要): Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short

論文の概要: Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short

arxiv url: http://arxiv.org/abs/2606.09380v1
Date: Mon, 08 Jun 2026 11:57:17 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-09 14:42:06.95568
Title: Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short
Title（参考訳）: Reasoning Arena: 検証可能なリワードが短くなったときのトレーストーナメント
Authors: Han Zhou, Adam X. Yang, Laurence Aitchison, Anna Korhonen, Albert Q. Jiang,
Abstract要約: 検証可能な報酬付き強化学習(RLVR)は,大規模言語モデルの推論能力向上のための主要なパラダイムとなっている。本研究では,非多変量報酬群を判定システムにルーティングする適応学習フレームワークであるReasoning Arenaを提案する。我々は、Reasoning Arenaが、競争数学やコーディングベンチマークにおいて、RLVRベースラインを平均で7.6%上回っていることを示す。
参考スコア（独自算出の注目度）: 51.667769734342635
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning with verifiable rewards (RLVR) has become a leading paradigm for improving the reasoning ability of large language models through outcome-based supervision. However, verifiable rewards frequently become uninformative at the group level: when all sampled traces of a given prompt receive identical rewards, group-relative advantage estimation provides no gradient signal, even though the traces may differ substantially in reasoning quality. We propose Reasoning Arena, an adaptive training framework that routes such non-diverse reward groups to a judge system instead of discarding them. Beyond examining the final answer, Reasoning Arena constructs trace tournaments, where reasoning traces are compared head-to-head to expose finer-grained preferences within the group, converting reasoning quality into rich relative reward signals. To make reward estimation efficient, rather than exhaustively comparing every pair, each new trace is evaluated against a small, dynamically updated pool of previously generated traces as anchors to efficiently establish a relative ranking. We then fit a Bradley-Terry model on the incomplete comparison graph, enabling scalable RL integration without quadratic pairwise comparisons. Empirical results demonstrate that Reasoning Arena consistently outperforms the RLVR baseline by 7.6% on average in competition mathematics and coding benchmarks. By converting otherwise wasted zero-advantage samples into useful gradient updates, our method accelerates training by 27% to 41%, saving nearly 50% of generation compute, and substantially improves overall reasoning performance.
Abstract（参考訳）: 検証可能な報酬(RLVR)による強化学習は、結果に基づく監督を通じて、大規模言語モデルの推論能力を向上させるための主要なパラダイムとなっている。しかしながら、検証可能な報酬は群レベルでしばしば非形式的になる:与えられたプロンプトのすべてのサンプルトレースが同じ報酬を受けるとき、グループ相対的優位性推定は、そのトレースが推論品質において著しく異なるとしても、勾配信号を提供しない。本研究では,非多変量報酬群を判定システムにルーティングする適応学習フレームワークであるReasoning Arenaを提案する。最終回答のほかに、Reasoning Arenaはトレーストーナメントを構築しており、そこでは、推論のトレースを比較して、グループ内のよりきめ細かい好みを露呈し、推論品質をリッチな相対的な報酬信号に変換する。各ペアを徹底的に比較するのではなく、報酬推定を効率よく行うために、各新しいトレースを、予め生成されたトレースの小さな動的更新プールに対してアンカーとして評価し、相対ランクを効率よく確立する。次に、Bradley-Terryモデルを不完全比較グラフに適合させ、二次対比較なしでスケーラブルなRL積分を可能にする。実験の結果、Reasoning Arenaは競争数学やコーディングのベンチマークでRLVRのベースラインを平均で7.6%上回っている。不要なゼロアドバンテージサンプルを有用な勾配更新に変換することで、トレーニングを27%から41%高速化し、世代計算の50%近くを節約し、全体的な推論性能を大幅に改善する。

論文の概要: Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short

関連論文リスト