Fugu-MT 論文翻訳(概要): Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning

論文の概要: Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning

arxiv url: http://arxiv.org/abs/2508.09726v1
Date: Wed, 13 Aug 2025 11:43:49 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-14 20:42:00.873113
Title: Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning
Title（参考訳）: 簡潔推論のためのグループフィルタポリシー最適化
Authors: Vaishnavi Shrivastava, Ahmed Awadallah, Vidhisha Balachandran, Shivam Garg, Harkirat Behl, Dimitris Papailiopoulos,
Abstract要約: グループフィルタポリシー最適化は、トレーニング中に問題ごとの大きなグループをサンプリングすることで、この長さの爆発を抑制する。 GFPOはGRPOの長さのインフレーションを46-71%削減し、STEMとコーディングベンチマークに挑戦する。 Adaptive Difficulty GFPOは、リアルタイムの難易度推定に基づいて、より厳しい問題により多くのトレーニングリソースを割り当てる。
参考スコア（独自算出の注目度）: 7.260825775935882
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models trained with reinforcement learning with verifiable rewards tend to trade accuracy for length--inflating response lengths to achieve gains in accuracy. While longer answers may be warranted for harder problems, many tokens are merely "filler": repetitive, verbose text that makes no real progress. We introduce GFPO (Group Filtered Policy Optimization), which curbs this length explosion by sampling larger groups per problem during training and filtering responses to train on based on two key metrics: (1) response length and (2) token efficiency: reward per token ratio. By sampling more at training time, we teach models to think less at inference time. On the Phi-4-reasoning model, GFPO cuts GRPO's length inflation by 46-71% across challenging STEM and coding benchmarks (AIME 24/25, GPQA, Omni-MATH, LiveCodeBench) while maintaining accuracy. Optimizing for reward per token further increases reductions in length inflation to 71-85%. We also propose Adaptive Difficulty GFPO, which dynamically allocates more training resources to harder problems based on real-time difficulty estimates, improving the balance between computational efficiency and accuracy especially on difficult questions. GFPO demonstrates that increased training-time compute directly translates to reduced test-time compute--a simple yet effective trade-off for efficient reasoning.
Abstract（参考訳）: 検証可能な報酬を持つ強化学習で訓練された大規模言語モデルは、精度を高めるために、長さのインフレーション応答長の精度を交換する傾向がある。より長い答えは難しい問題に対して保証されるかもしれませんが、多くのトークンは単に"満杯"なのです。 GFPO(Group Filtered Policy Optimization)は,(1)応答長と(2)トークン効率:トークン当たりの報酬比の2つの主要な指標に基づいて,トレーニング中の問題ごとの大きなグループをサンプリングし,トレーニング中の反応をフィルタリングすることで,この長さの爆発を抑制する。トレーニング時により多くのサンプルをサンプリングすることで、推論時間の短縮をモデルに教えます。 Phi-4-resoningモデルでは、GFPOはGRPOの長さのインフレーションを46-71%削減し、STEMとコーディングベンチマーク(AIME 24/25、GPQA、Omni-MATH、LiveCodeBench)で精度を維持した。トークン当たりの報酬の最適化はさらに、長さインフレーションの削減を71-85%に増やす。また、リアルタイムの難易度推定に基づいて、より複雑な問題に動的により多くのトレーニングリソースを割り当て、特に難解な問題に対して、計算効率と精度のバランスを改善するAdaptive Difficulty GFPOを提案する。 GFPOは、トレーニングタイムの増大がテストタイムの削減に直接変換されることを実証している。

論文の概要: Sample More to Think Less: Group Filtered Policy Optimization for Concise Reasoning

関連論文リスト