Fugu-MT 論文翻訳(概要): Beyond Token Length: Step Pruner for Efficient and Accurate Reasoning in Large Language Models

論文の概要: Beyond Token Length: Step Pruner for Efficient and Accurate Reasoning in Large Language Models

arxiv url: http://arxiv.org/abs/2510.03805v1
Date: Sat, 04 Oct 2025 13:24:26 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-07 16:52:59.26237
Title: Beyond Token Length: Step Pruner for Efficient and Accurate Reasoning in Large Language Models
Title（参考訳）: トークン長を超えて:大規模言語モデルにおける効率的かつ正確な推論のためのステッププルーナ
Authors: Canhui Wu, Qiong Cao, Chang Li, Zhenfang Wang, Chao Xue, Yuwei Fan, Wei Xi, Xiaodong He,
Abstract要約: 大きな推論モデル(LRM)は複雑なタスクにおいて強いパフォーマンスを示すが、しばしば過剰な冗長性に悩まされる。コンパクトな推論ステップを好んで, LRM をより効率的に推論するための RL フレームワークである textbfStep Pruner (SP) を導入する。我々のステップアウェア報酬関数は、冗長なステップに対して罰則を課しながら正当性を優先し、誤った推論の強化を防ぐための誤った応答に対する報酬を控える。
参考スコア（独自算出の注目度）: 26.88030285500965
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Reasoning Models (LRMs) demonstrate strong performance on complex tasks but often suffer from excessive verbosity, known as "overthinking." Existing solutions via reinforcement learning (RL) typically penalize generated tokens to promote conciseness. However, these methods encounter two challenges: responses with fewer tokens do not always correspond to fewer reasoning steps, and models may develop hacking behavior in later stages of training by discarding reasoning steps to minimize token usage. In this work, we introduce \textbf{Step Pruner (SP)}, an RL framework that steers LRMs toward more efficient reasoning by favoring compact reasoning steps. Our step-aware reward function prioritizes correctness while imposing penalties for redundant steps, and withholds rewards for incorrect responses to prevent the reinforcement of erroneous reasoning. Moreover, we propose a dynamic stopping mechanism: when the length of any output step exceeds the upper limit, we halt updates to prevent hacking behavior caused by merging steps. Extensive experiments across four reasoning benchmarks demonstrate that SP achieves state-of-the-art accuracy while significantly reducing response length. For instance, on AIME24, SP reduces token usage by \textbf{69.7\%}.
Abstract（参考訳）: 大規模推論モデル(LRM)は複雑なタスクに対して強いパフォーマンスを示すが、しばしば過剰な冗長性("overthinking")に悩まされる。既存のソリューションである強化学習(RL)は、簡潔性を促進するために生成されたトークンをペナライズする。しかし、これらの手法は2つの課題に遭遇する: トークンが少ない応答は、必ずしも推論ステップの少なさに対応しておらず、モデルはトークンの使用を最小限に抑えるために推論ステップを捨てることで、トレーニングの後半段階でハッキング動作を発達させる可能性がある。本稿では, LRM をより効率的な推論へ向け, コンパクトな推論ステップを優先することで, より効率的な推論を行う RL フレームワークである \textbf{Step Pruner (SP)} を紹介する。我々のステップアウェア報酬関数は、冗長なステップに対して罰則を課しながら正当性を優先し、誤った推論の強化を防ぐための誤った応答に対する報酬を控える。さらに,任意の出力ステップの長さが上限を超えると,ステップのマージによるハッキング動作を防止するために更新を停止する動的停止機構を提案する。 4つの推論ベンチマークによる大規模な実験により、SPは応答長を著しく減少させながら最先端の精度を達成することが示された。例えば AIME24 では、SP は \textbf{69.7\%} でトークンの使用を減らす。

論文の概要: Beyond Token Length: Step Pruner for Efficient and Accurate Reasoning in Large Language Models

関連論文リスト