Fugu-MT 論文翻訳(概要): RHyVE: Competence-Aware Verification and Phase-Aware Deployment for LLM-Generated Reward Hypotheses

論文の概要: RHyVE: Competence-Aware Verification and Phase-Aware Deployment for LLM-Generated Reward Hypotheses

arxiv url: http://arxiv.org/abs/2604.28056v1
Date: Thu, 30 Apr 2026 16:01:51 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-01 16:31:54.189538
Title: RHyVE: Competence-Aware Verification and Phase-Aware Deployment for LLM-Generated Reward Hypotheses
Title（参考訳）: RHyVE: LLM生成逆仮説のコンピテンス・アウェア検証とフェーズ・アウェア展開
Authors: Feiyu Wu, Xu Zheng, Zhuocheng Wang, Yi ming Dai, Hui Li,
Abstract要約: 大規模言語モデル(LLM)は、強化学習における報酬設計をかなりスケーラブルにするが、生成された報酬は自動的に信頼性のある訓練目標ではない。本稿では,現在の政策の能力に依拠する報酬仮説として,生成した報酬を扱い,この展開時問題を考察する。我々は,有能な検証とフェーズアウェアなデプロイメントプロトコルであるtextscRHyVEを提案する。
参考スコア（独自算出の注目度）: 7.123785374544969
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) make reward design in reinforcement learning substantially more scalable, but generated rewards are not automatically reliable training objectives. Existing work has focused primarily on generating, evolving, or selecting reward candidates, while paying less attention to when such candidates can be verified and deployed during policy optimization. We study this deployment-time problem by treating generated rewards as reward hypotheses whose utility depends on the competence of the current policy and the phase of training. We propose \textsc{RHyVE}, a competence-aware verification and phase-aware deployment protocol that compares small sets of reward hypotheses from shared policy checkpoints using short-horizon fork verification. Our experiments show that reward rankings are unreliable at low competence but become informative after task-dependent thresholds. On a sparse manipulation task, phase-aware deployment improves peak and retained performance under a locked protocol. Updated LLM-generated reward-candidate experiments show candidate-family-dependent behavior: generated pools can exhibit phase-dependent winner changes, but no fixed warm-up schedule is universally optimal. Held-out schedule selection, conservative selector baselines, compute-matched controls, and scale controls further show that \textsc{RHyVE} is best understood as a verification-informed deployment protocol rather than a universal scheduler. Dense and all-failure boundary experiments delimit the scope of the method. Together, these results suggest that reward generation and reward deployment should be studied as coupled problems: generated rewards must be verified and deployed under changing policy competence.
Abstract（参考訳）: 大規模言語モデル(LLM)は、強化学習における報酬設計をかなりスケーラブルにするが、生成された報酬は自動的に信頼性のある訓練目標ではない。既存の作業は、主に報酬候補の生成、進化、選択に重点を置いている一方で、政策最適化中にその候補がいつ検証され、デプロイされるかに注意を払っていない。本研究では,現在の政策の能力と訓練の段階に依存する報酬仮説として,生成した報酬を扱い,この展開時問題を考察する。本稿では,短時間のフォーク検証を用いた共有ポリシチェックポイントからの報酬仮説の小さなセットを比較する,能力認識型検証および位相認識デプロイメントプロトコルである‘textsc{RHyVE} を提案する。実験の結果,報酬ランキングは低い能力では信頼性が低いが,タスク依存しきい値の後に情報化されることがわかった。スパース操作タスクでは、フェーズアウェアデプロイメントはピークを改善し、ロックされたプロトコルの下でパフォーマンスを維持する。生成プールは位相依存的な勝者変化を示すことができるが、固定ウォームアップスケジュールは普遍的に最適ではない。 Held-out schedule selection, conservative selector baselines, compute-matched control, and scale control shows that \textsc{RHyVE} is most understand as a verification-informed deployment protocol than a universal scheduler。難易度および全欠陥境界実験は、その方法の範囲を逸脱する。これらの結果は、報酬の生成と報酬の展開は、複合的な問題として研究されるべきであることを示唆している。

論文の概要: RHyVE: Competence-Aware Verification and Phase-Aware Deployment for LLM-Generated Reward Hypotheses

関連論文リスト