Fugu-MT 論文翻訳(概要): Label-Free Reinforcement Learning via Cross-Model Entropy

論文の概要: Label-Free Reinforcement Learning via Cross-Model Entropy

arxiv url: http://arxiv.org/abs/2605.29009v1
Date: Wed, 27 May 2026 19:04:35 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-30 02:45:55.33369
Title: Label-Free Reinforcement Learning via Cross-Model Entropy
Title（参考訳）: クロスモデルエントロピーによるラベルフリー強化学習
Authors: Matt Gorbett, Hossein Shirazi,
Abstract要約: 強化学習を伴う学習後の大規模言語モデルは、報酬信号によってボトルネックとなる。 RL後学習のためのラベルなし報酬信号としてクロスモデルエントロピー(CME)を提案する。 CMEは継続的で、トレーニングなしであり、検証者が予想外の応答が正しいか、品質が高いと判断する原則に基づいている。
参考スコア（独自算出の注目度）: 4.404496835736175
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Post-training large language models with reinforcement learning is bottlenecked by the reward signal. Existing approaches require either ground-truth verifiable rewards, restricting training to domains with automatic correctness checks (e.g., mathematics, code execution), or human preference labels, which are expensive to collect and prone to reward hacking. Recent label-free methods replace ground-truth verifiers with self-referential signals like majority voting or token entropy over a model's own outputs, but risk reinforcing a model's own errors. In this work we propose Cross-Model Entropy (CME), the mean log-likelihood of a generator's response under a separate verifier model, as a label-free reward signal for RL post-training. CME is continuous, training-free, and grounded in the principle that responses a verifier finds unsurprising are likely correct or high quality. Because the verifier is independent of the generator, the signal cannot be gamed through self-consistency. We integrate CME into GRPO with no other changes to the training loop, extending label-free RL to open-ended instruction following -- a regime where self-referential signals are inapplicable or poorly suited. On open-ended instruction following (UltraFeedback prompts, evaluated on AlpacaEval 2.0), CME rewards beat the untrained base in head-to-head LLM-as-Judge comparisons across four model families (Qwen, Llama, Gemma, OLMo) and three training regimes (pretrained, SFT, and instruction-tuned), with tie-adjusted win rates ranging from 52.5% to 71.4%. Code will be released upon publication.
Abstract（参考訳）: 強化学習を伴う学習後の大規模言語モデルは、報酬信号によってボトルネックとなる。既存のアプローチでは、基本的な検証可能な報酬、自動正当性チェック(数学、コード実行など)を持つドメインへのトレーニングの制限、あるいはハッキングの回収と報奨に費用がかかる人間の選好ラベルのいずれかが必要となる。最近のラベルフリー手法は、モデル自身の出力に対する多数決やトークンエントロピーのような自己参照的な信号に代えて、モデル自身のエラーを補強するリスクを負う。本研究では,RL後学習のためのラベルなし報酬信号として,別個の検証器モデルの下でのジェネレータ応答の平均対数類似度であるクロスモデルエントロピー(CME)を提案する。 CMEは継続的で、トレーニングなしであり、検証者が予想外の応答が正しいか、品質が高いと判断する原則に基づいている。検証者はジェネレータから独立しているため、信号は自己整合性によってゲームすることはできない。我々は、CMEをGRPOに統合し、トレーニングループに他の変更を加えることなく、ラベルのないRLをオープンエンドのインストラクションに拡張します。 UltraFeedback prompts, evaluate on AlpacaEval 2.0, CME rewards beat the untrained base in head-to-head-as-Judge comparisons across four model family (Qwen, Llama, Gemma, OLMo) and three training regimes (pretrained, SFT, and instruction-tuned) with tie-adjusted win rate from 52.5% to 71.4%。コードは公開時に公開される。

論文の概要: Label-Free Reinforcement Learning via Cross-Model Entropy

関連論文リスト