Fugu-MT 論文翻訳(概要): BPPO: Binary Prefix Policy Optimization for Efficient GRPO-Style Reasoning RL with Concise Responses

論文の概要: BPPO: Binary Prefix Policy Optimization for Efficient GRPO-Style Reasoning RL with Concise Responses

arxiv url: http://arxiv.org/abs/2605.28028v1
Date: Wed, 27 May 2026 06:34:17 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-28 17:38:55.804108
Title: BPPO: Binary Prefix Policy Optimization for Efficient GRPO-Style Reasoning RL with Concise Responses
Title（参考訳）: BPPO: 簡潔応答を考慮した高効率GRPOスタイル推論RLのバイナリ事前修正最適化
Authors: Qingfei Zhao, Huan Song, Shuyu Tian, Jiawei Shao, Xuelong Li,
Abstract要約: GRPO型推論RLにおいて,全ての完了が等しく有用な更新信号を提供するか否かを検討する。我々の勾配類似性分析は、同じプロンプト群において、同じクラス補完がしばしば非常に類似した更新方向を誘導することを示している。本稿では,最短の修正完了と最短の修正完了をコンパクトな更新単位として利用するBPPO(Binary Prefix Policy Optimization)を提案する。
参考スコア（独自算出の注目度）: 48.550535291129584
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Group Relative Policy Optimization (GRPO) is widely used for training reasoning models, but updating all sampled completions in each group incurs substantial cost and can reinforce verbose reasoning trajectories. In this paper, we study whether all completions provide equally useful update signals in GRPO-style reasoning RL. Our gradient-similarity analysis shows that, within the same prompt group, same-class completions often induce highly similar update directions, whereas correct-incorrect pairs provide more distinct contrastive signals. Motivated by this observation, we propose Binary Prefix Policy Optimization (BPPO), which uses the shortest correct completion and the shortest incorrect completion as a compact update unit while preserving full-group advantage normalization. BPPO further improves efficiency with adaptive completion scheduling and prefix-focused optimization; by updating only response prefixes, it avoids reinforcing redundant suffixes and encourages more concise responses. Experiments on GSM8K, MATH, and Geo3K show that BPPO achieves up to 6.08x speedup over GRPO while maintaining competitive accuracy, and reduces mean response length by approximately 30-50% without modifying the reward with an explicit length penalty.
Abstract（参考訳）: グループ相対政策最適化(GRPO)は、推論モデルの訓練に広く用いられているが、各グループでサンプリングされた完了点の更新にはかなりのコストがかかり、冗長な推論軌跡を補強することができる。本稿では、GRPO方式の推論RLにおいて、全ての完了が等しく有用な更新信号を提供するかどうかを考察する。我々の勾配-類似性分析は、同じプロンプト群において、同じクラス補完がしばしば非常に類似した更新方向を誘導するのに対し、正しい不正確なペアはより明確なコントラスト信号を提供することを示している。本研究の目的は,完全群優位正規化を保ちながら,最短の正解と最短の誤完了をコンパクトな更新単位として用い,BPPO(Binary Prefix Policy Optimization)を提案することである。 BPPOは適応的な完了スケジューリングとプレフィックス中心の最適化によって効率をさらに改善し、応答プレフィックスのみを更新することで、冗長な接尾辞の強化を回避し、より簡潔な応答を促進する。 GSM8K、MATH、Geo3Kの実験では、BPPOはGRPOよりも最大6.08倍のスピードアップを達成し、平均応答長を約30～50%削減する。

論文の概要: BPPO: Binary Prefix Policy Optimization for Efficient GRPO-Style Reasoning RL with Concise Responses

関連論文リスト