Fugu-MT 論文翻訳(概要): Learning to Hint for Reinforcement Learning

論文の概要: Learning to Hint for Reinforcement Learning

arxiv url: http://arxiv.org/abs/2604.00698v1
Date: Wed, 01 Apr 2026 09:58:08 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-02 16:44:31.932634
Title: Learning to Hint for Reinforcement Learning
Title（参考訳）: 強化学習のためのヒントの学習
Authors: Yu Xia, Canwen Xu, Zhewei Yao, Julian McAuley, Yuxiong He,
Abstract要約: グループ相対政策最適化(GRPO)は、検証可能な報酬を伴う強化学習に広く用いられている。 GRPOは、グループ内のすべてのロールアウトが同じ報酬を受けると、しばしば有利な崩壊に苦しむ。 Hint Learning for Reinforcement Learning (HiLL)を提案する。
参考スコア（独自算出の注目度）: 51.46328710610512
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Group Relative Policy Optimization (GRPO) is widely used for reinforcement learning with verifiable rewards, but it often suffers from advantage collapse: when all rollouts in a group receive the same reward, the group yields zero relative advantage and thus no learning signal. For example, if a question is too hard for the reasoner, all sampled rollouts can be incorrect and receive zero reward. Recent work addresses this issue by adding hints or auxiliary scaffolds to such hard questions so that the reasoner produces mixed outcomes and recovers a non-zero update. However, existing hints are usually fixed rather than adapted to the current reasoner, and a hint that creates learning signal under the hinted input does not necessarily improve the no-hint policy used at test time. To this end, we propose Hint Learning for Reinforcement Learning (HiLL), a framework that jointly trains a hinter policy and a reasoner policy during RL. For each hard question, the hinter generates hints online conditioned on the current reasoner's incorrect rollout, allowing hint generation to adapt to the reasoner's evolving errors. We further introduce hint reliance, which measures how strongly correct hinted trajectories depend on the hint. We derive a transferability result showing that lower hint reliance implies stronger transfer from hinted success to no-hint success, and we use this result to define a transfer-weighted reward for training the hinter. Therefore, HiLL favors hints that not only recover informative GRPO groups, but also produce signals that are more likely to improve the original no-hint policy. Experiments across multiple benchmarks show that HiLL consistently outperforms GRPO and prior hint-based baselines, demonstrating the value of adaptive and transfer-aware hint learning for RL. The code is available at https://github.com/Andree-9/HiLL.
Abstract（参考訳）: グループ相対政策最適化(GRPO)は、検証可能な報酬を伴う強化学習に広く用いられているが、しばしば利点の崩壊に悩まされる。例えば、ある質問が理性者には難しすぎる場合、すべてのサンプルロールアウトは誤りであり、報酬はゼロである。最近の研究は、このような難解な質問にヒントや補助的な足場を加えることでこの問題に対処し、推論者が混合結果を生成し、ゼロでない更新を回復する。しかし、既存のヒントは通常、現在の推論に適応するのではなく固定されており、ヒント入力の下で学習信号を生成するヒントは、テスト時に使われるノハトポリシーを必ずしも改善しない。そこで本研究では,RL中にヒントと推論ポリシーを共同で訓練するフレームワークであるHint Learning for Reinforcement Learning (HiLL)を提案する。それぞれの難しい質問に対して、ヒントは現在の推論者の誤ったロールアウトでオンライン条件付きヒントを生成し、ヒント生成は推論者の進化するエラーに適応する。さらに、ヒントに依存した軌道の正確さを測るヒント依存を導入する。提案手法は,低いヒント信頼度が示唆された成功から隠れない成功への強い伝達を意味することを示す伝達可能性の結果を導出し,この結果を用いて,ヒントを訓練するための伝達重み付き報酬を定義する。したがって、HiLLは情報的なGRPOグループを回復するだけでなく、元のno-hintポリシーを改善する可能性が高いシグナルを生成するというヒントを好んでいる。複数のベンチマークで実験したところ、HiLLはGRPOと従来のヒントベースのベースラインを一貫して上回り、RLの適応型および移動型ヒント学習の価値を示している。コードはhttps://github.com/Andree-9/HiLLで公開されている。

論文の概要: Learning to Hint for Reinforcement Learning

関連論文リスト