Fugu-MT 論文翻訳(概要): Poly-EPO: Training Exploratory Reasoning Models

論文の概要: Poly-EPO: Training Exploratory Reasoning Models

arxiv url: http://arxiv.org/abs/2604.17654v1
Date: Sun, 19 Apr 2026 22:54:19 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:52.621599
Title: Poly-EPO: Training Exploratory Reasoning Models
Title（参考訳）: Poly-EPO:探査推論モデルの訓練
Authors: Ifdita Hasan Orney, Jubayer Ibn Hamid, Shreya S Ramanujam, Shirley Wu, Hengyuan Hu, Noah Goodman, Dorsa Sadigh, Chelsea Finn,
Abstract要約: 本稿では,学習後言語モデル(LM)の枠組みについて,楽観的な探索を明示的に奨励し,探索と搾取の相乗効果を促進する。本稿では,この枠組みを探索と利用を明確に相乗化するための目的として,ポリクロミック探索政策最適化(Poly-EPO)を提案する。
参考スコア（独自算出の注目度）: 62.82992914206963
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Exploration is a cornerstone of learning from experience: it enables agents to find solutions to complex problems, generalize to novel ones, and scale performance with test-time compute. In this paper, we present a framework for post-training language models (LMs) that explicitly encourages optimistic exploration and promotes a synergy between exploration and exploitation. The central idea is to train the LM to generate sets of responses that are collectively accurate under the reward function and exploratory in their reasoning strategies. We first develop a general recipe for optimizing LMs with set reinforcement learning (set RL) under arbitrary objective functions, showing how standard RL algorithms can be adapted to this setting through a modification to the advantage computation. We then propose Polychromic Exploratory Policy Optimization (Poly-EPO), which instantiates this framework with an objective that explicitly synergizes exploration and exploitation. Across a range of reasoning benchmarks, we show that Poly-EPO improves generalization, as evidenced by higher pass@$k$ coverage, preserves greater diversity in model generations, and effectively scales with test-time compute.
Abstract（参考訳）: エージェントは複雑な問題に対する解決策を見つけ出し、新しい問題に一般化し、テスト時間計算でパフォーマンスをスケールすることができる。本稿では,学習後言語モデル(LM)の枠組みについて,楽観的な探索を促進するとともに,探索と搾取の相乗効果を促進する。中心となる考え方は、報酬関数の下で集合的に正確であり、彼らの推論戦略で探索的な応答を生成するためにLMを訓練することである。まず、任意の目的関数下での強化学習(セットRL)でLMを最適化するための一般的なレシピを開発し、この設定に標準RLアルゴリズムがどのように適応できるかを示す。次に、このフレームワークを探索と利用を明示的に相乗化するための目的として、ポリクロミック探索ポリシー最適化(Poly-EPO)を提案する。様々な推論ベンチマークにおいて、Poly-EPOは、より高いパス@$k$カバレッジによって証明されるように、一般化を改善し、モデル生成の多様性を保ち、テスト時間計算で効果的にスケールすることを示す。

論文の概要: Poly-EPO: Training Exploratory Reasoning Models

関連論文リスト