Fugu-MT 論文翻訳(概要): FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization

論文の概要: FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization

arxiv url: http://arxiv.org/abs/2603.19835v2
Date: Tue, 24 Mar 2026 03:56:41 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-25 12:42:17.583011
Title: FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization
Title（参考訳）: FIPO: 将来のKLによる政策最適化との深い関係を回避
Authors: Chiyu Ma, Shuo Yang, Kexin Huang, Jinda Lu, Haoming Meng, Shangshang Wang, Bolin Ding, Soroush Vosoughi, Guoyin Wang, Jingren Zhou,
Abstract要約: 本稿では,大規模言語モデルにおける推論ボトルネックを克服する強化学習アルゴリズムであるFuture-KL Influenced Policy Optimization (FIPO)を提案する。 FIPOは、割引先KLの分岐をポリシー更新に組み込むことでこの問題に対処し、その後の軌道行動への影響に基づいてトークンを再重み付けする密集した有利な定式化を作成する。 Qwen2.5-32Bで評価され、FIPOは平均チェーン長を約4,000から10,000以上のトークンに拡張し、AIME 2024 Pass@1の精度を50.0%から58.0%に向上させた。
参考スコア（独自算出の注目度）: 84.58281577727566
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present Future-KL Influenced Policy Optimization (FIPO), a reinforcement learning algorithm designed to overcome reasoning bottlenecks in large language models. While GRPO style training scales effectively, it typically relies on outcome-based rewards (ORM) that distribute a global advantage uniformly across every token in a trajectory. We argue that this coarse-grained credit assignment imposes a performance ceiling by failing to distinguish critical logical pivots from trivial tokens. FIPO addresses this by incorporating discounted future-KL divergence into the policy update, creating a dense advantage formulation that re-weights tokens based on their influence on subsequent trajectory behavior. Empirically, FIPO enables models to break through the length stagnation seen in standard baselines. Evaluated on Qwen2.5-32B, FIPO extends the average chain-of-thought length from roughly 4,000 to over 10,000 tokens and increases AIME 2024 Pass@1 accuracy from 50.0% to a peak of 58.0% (converging at approximately 56.0\%). This outperforms both DeepSeek-R1-Zero-Math-32B (around 47.0%) and o1-mini (approximately 56.0%). Our results suggest that establishing dense advantage formulations is a vital path for evolving ORM-based algorithms to unlock the full reasoning potential of base models. We open-source our training system, built on the verl framework.
Abstract（参考訳）: 本稿では,大規模言語モデルにおける推論ボトルネックを克服する強化学習アルゴリズムであるFuture-KL Influenced Policy Optimization (FIPO)を提案する。 GRPOスタイルのトレーニングは効果的にスケールするが、一般的には結果に基づく報酬(ORM)に依存している。この粗い粒度のクレジット割り当ては、重要な論理的ピボットと自明なトークンを区別できないことによって、パフォーマンスの天井を課している、と我々は主張する。 FIPOは、割引先KLの分岐をポリシー更新に組み込むことでこの問題に対処し、その後の軌道行動への影響に基づいてトークンを再重み付けする密集した有利な定式化を作成する。経験的に、FIPOはモデルが標準ベースラインで見られる長さの停滞を突破することを可能にする。 Qwen2.5-32Bで評価され、FIPOは平均チェーン長を約4,000から1万以上のトークンに拡張し、AIME 2024 Pass@1の精度を50.0%から58.0%に向上させた(56.0\%)。これはDeepSeek-R1-Zero-Math-32B(約47.0%)とo1-mini(約56.0%)の両方を上回っている。以上の結果から,高密度な有利な定式化を確立することは,ORMベースのアルゴリズムを進化させ,ベースモデルの完全な推論可能性を解き放つ上で極めて重要な方法であることが示唆された。 Verlフレームワーク上に構築されたトレーニングシステムをオープンソースとして公開しています。

論文の概要: FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization

関連論文リスト