Fugu-MT 論文翻訳(概要): RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards

論文の概要: RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards

arxiv url: http://arxiv.org/abs/2509.21319v1
Date: Thu, 25 Sep 2025 16:19:06 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-26 20:58:13.033542
Title: RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards
Title（参考訳）: RLBFF: ヒューマンフィードバックと検証可能なリワードの間のブリッジへのバイナリフレキシブルフィードバック
Authors: Zhilin Wang, Jiaqi Zeng, Olivier Delalleau, Ellie Evans, Daniel Egert, Hoo-Chang Shin, Felipe Soares, Yi Dong, Oleksii Kuchaiev,
Abstract要約: バイナリフレキシブルフィードバック(RLBFF)を用いた強化学習を提案する。 RLBFFは、人間主導の好みの汎用性とルールベースの検証の精度を組み合わせる。この方法で訓練されたReward Modelsは、データにマッチするとBradley-Terryモデルより優れていることを示す。
参考スコア（独自算出の注目度）: 29.53129965767002
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement Learning with Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) are the main RL paradigms used in LLM post-training, each offering distinct advantages. However, RLHF struggles with interpretability and reward hacking because it relies on human judgments that usually lack explicit criteria, whereas RLVR is limited in scope by its focus on correctness-based verifiers. We propose Reinforcement Learning with Binary Flexible Feedback (RLBFF), which combines the versatility of human-driven preferences with the precision of rule-based verification, enabling reward models to capture nuanced aspects of response quality beyond mere correctness. RLBFF extracts principles that can be answered in a binary fashion (e.g. accuracy of information: yes, or code readability: no) from natural language feedback. Such principles can then be used to ground Reward Model training as an entailment task (response satisfies or does not satisfy an arbitrary principle). We show that Reward Models trained in this manner can outperform Bradley-Terry models when matched for data and achieve top performance on RM-Bench (86.2%) and JudgeBench (81.4%, #1 on leaderboard as of September 24, 2025). Additionally, users can specify principles of interest at inference time to customize the focus of our reward models, in contrast to Bradley-Terry models. Finally, we present a fully open source recipe (including data) to align Qwen3-32B using RLBFF and our Reward Model, to match or exceed the performance of o3-mini and DeepSeek R1 on general alignment benchmarks of MT-Bench, WildBench, and Arena Hard v2 (at <5% of the inference cost).
Abstract（参考訳）: Reinforcement Learning with Human Feedback (RLHF) と Reinforcement Learning with Verifiable Rewards (RLVR) は、LLMポストトレーニングで使用される主要なRLパラダイムであり、それぞれに明確な利点がある。しかし、RLHFは通常明確な基準を欠いている人間の判断に依存しているため、解釈可能性や報酬のハッキングに苦慮している。本稿では,RLBFF(Reinforcement Learning with Binary Flexible Feedback)を提案する。このRLBFFは,人間主導の嗜好の汎用性とルールベース検証の精度を組み合わせることで,報酬モデルによる応答品質の微妙な側面を,単なる正しさを超えて捉えることができる。 RLBFFは、自然言語のフィードバックから、バイナリ形式で答えられる原則(例えば、情報の正確性:イエス、コード可読性:ノー)を抽出する。このような原則は、リワードモデルトレーニングを必然的なタスク(応答は満足するか、任意の原則を満たさない)として基礎付けるのに使用できる。この方法でトレーニングされたReward Modelsは、データにマッチしたBradley-Terryモデルより優れ、RM-Bench (86.2%) とジャッジベンチ (81.4%、2025年9月24日現在、リーダーボードで1位) で最高のパフォーマンスを達成できることを示す。さらにユーザは、Bradley-Terryモデルとは対照的に、推論時に関心の原則を指定して、報酬モデルの焦点をカスタマイズすることができます。最後に、MT-Bench、WildBench、Arena Hard v2(推論コストの5%)の一般的なアライメントベンチマークにおいて、RLBFFとReward Modelを用いてQwen3-32Bを調整し、o3-miniおよびDeepSeek R1のパフォーマンスを一致させる(データを含む)完全なオープンソースレシピを提案する。

論文の概要: RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards

関連論文リスト