Fugu-MT 論文翻訳(概要): Pruning Long Chain-of-Thought of Large Reasoning Models via Small-Scale Preference Optimization

論文の概要: Pruning Long Chain-of-Thought of Large Reasoning Models via Small-Scale Preference Optimization

arxiv url: http://arxiv.org/abs/2508.10164v1
Date: Wed, 13 Aug 2025 20:00:09 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-15 22:24:48.105065
Title: Pruning Long Chain-of-Thought of Large Reasoning Models via Small-Scale Preference Optimization
Title（参考訳）: 小規模選好最適化による大型共振モデルの長期連鎖解析
Authors: Bin Hong, Jiayu Liu, Zhenya Huang, Kai Zhang, Mengdi Zhang,
Abstract要約: 大規模推論モデル(LRM)は、長いチェーン・オブ・ソート(CoT)推論を通じて複雑なタスクに強い性能を示す。長いアウトプットは計算コストを増大させ、過度に考え直し、推論の有効性と効率のバランスをとる上での課題を提起する。本稿では, LRMの生成時間を削減するための効率的な手法について検討する。
参考スコア（独自算出の注目度）: 26.462701299259248
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in Large Reasoning Models (LRMs) have demonstrated strong performance on complex tasks through long Chain-of-Thought (CoT) reasoning. However, their lengthy outputs increase computational costs and may lead to overthinking, raising challenges in balancing reasoning effectiveness and efficiency. Current methods for efficient reasoning often compromise reasoning quality or require extensive resources. This paper investigates efficient methods to reduce the generation length of LRMs. We analyze generation path distributions and filter generated trajectories through difficulty estimation. Subsequently, we analyze the convergence behaviors of the objectives of various preference optimization methods under a Bradley-Terry loss based framework. Based on the analysis, we propose Length Controlled Preference Optimization (LCPO) that directly balances the implicit reward related to NLL loss. LCPO can effectively learn length preference with limited data and training. Extensive experiments demonstrate that our approach significantly reduces the average output length by over 50\% across multiple benchmarks while maintaining the reasoning performance. Our work highlights the potential for computationally efficient approaches in guiding LRMs toward efficient reasoning.
Abstract（参考訳）: 近年のLarge Reasoning Models (LRMs) の進歩は、長いチェーン・オブ・ソート(CoT)推論を通じて複雑なタスクに強い性能を示す。しかし、長いアウトプットは計算コストを増大させ、再考を招き、推論の有効性と効率のバランスをとる上での課題を提起する。効率的な推論のための現在の手法は、しばしば推論の品質を損なうか、広範囲のリソースを必要とする。本稿では, LRMの生成時間を削減するための効率的な手法について検討する。難易度推定により生成経路分布とフィルタ生成軌跡を解析する。その後、Bradley-Terry損失に基づくフレームワークを用いて、様々な選好最適化手法の目的の収束挙動を解析した。そこで本研究では,NLL損失に関する暗黙の報酬を直接バランスするLongth Controlled Preference Optimization (LCPO)を提案する。 LCPOは、限られたデータとトレーニングで、効果的に長さの好みを学習できる。実験の結果,提案手法は複数のベンチマークにおいて平均出力長を50%以上削減し,推算性能を維持できることがわかった。本研究は, LRMを効率的な推論へ導く上で, 計算効率のよいアプローチの可能性を強調した。

論文の概要: Pruning Long Chain-of-Thought of Large Reasoning Models via Small-Scale Preference Optimization

関連論文リスト