Fugu-MT 論文翻訳(概要): STOP: Structured On-Policy Pruning of Long-Form Reasoning in Low-Data Regimes

論文の概要: STOP: Structured On-Policy Pruning of Long-Form Reasoning in Low-Data Regimes

arxiv url: http://arxiv.org/abs/2605.13165v1
Date: Wed, 13 May 2026 08:28:05 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-14 23:30:27.91268
Title: STOP: Structured On-Policy Pruning of Long-Form Reasoning in Low-Data Regimes
Title（参考訳）: STOP:低データレジームにおける長期推論の構造化オンポリシィプルーニング
Authors: Chenjun Xu, Zhennan Zhou, Zhan Su, Bill Howe, Lucy Lu Wang, Bingbing Wen,
Abstract要約: ロングチェーン・オブ・シークレット (Long CoT) 推論は多段階問題のパフォーマンスを向上させるが、過度な考えも引き起こす。長文推論トレースを解析・解析するオンラインアルゴリズムSTOP(Structured On-policy Pruning)を提案する。
参考スコア（独自算出の注目度）: 13.293115227628775
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Long chain-of-thought (Long CoT) reasoning improves performance on multi-step problems, but it also induces overthinking: models often generate low-yield reasoning that increases inference cost and latency. This inefficiency is especially problematic in low-data fine-tuning regimes, where real applications adapt reasoning models with limited supervision and cannot rely on large-scale teacher distillation or heavy test-time control. To address this, we propose STOP (Structured On-policy Pruning), an on-policy algorithm for analyzing and pruning long-form reasoning traces. STOP constructs self-distilled traces from the model. Then it maps each trace into a structured reasoning interface through node segmentation, taxonomy annotation, and reasoning-tree construction. On top of this interface, we introduce ECN (Earliest Correct Node), which retains the shortest prefix ending at the earliest node that both functions as an answering conclusion and yields the correct final answer, removing redundant post-solution reasoning while preserving semantic continuity. Experiments on DeepSeek-R1-Distill-Qwen-7B and DeepSeek-R1-Distill-LLaMA-3-8B across GSM8K, Math 500, and AIME 2024 show that STOP reduces generated tokens by 19.4-42.4% while largely preserving accuracy in low-data fine-tuning. Beyond efficiency, our analyses show that STOP induces much smaller distributional shift than teacher-guided pruning, improves the structural efficiency of generated reasoning, and reallocates reasoning effort away from redundant verification and backtracking toward more productive exploration.
Abstract（参考訳）: ロングチェーン・オブ・シークレット(Long CoT)推論は、マルチステップ問題のパフォーマンスを改善するが、過度な思考を引き起こす。この非効率性は低データの微調整体制において特に問題であり、実際のアプリケーションは限定的な監督による推論モデルに適応し、大規模な教師の蒸留やテストタイムの過剰な制御に依存しない。そこで本稿では,長文推論トレースを解析・解析するオンラインアルゴリズムSTOP(Structured On-policy Pruning)を提案する。 STOPはモデルから自己蒸留されたトレースを構成する。次に、ノードのセグメンテーション、分類学アノテーション、推論木構築を通じて、各トレースを構造化された推論インターフェースにマッピングする。このインターフェース上には,最短の接頭辞を最短のノードに保持するECN(Earliest Correct Node)を導入し,それぞれが回答の結論として機能し,適切な最終回答を出力し,意味的連続性を維持しながら冗長な解法推論を除去する。 DeepSeek-R1-Distill-Qwen-7BとDeepSeek-R1-Distill-LLaMA-3-8BをGSM8K、Math 500、AIME 2024で行った実験では、STOPは生成されたトークンを19.4-42.4%削減し、低データの微調整の精度を保った。分析の結果,STOPは教師が指導するプルーニングよりも分布シフトをはるかに小さくし,生成した推論の構造的効率を向上し,冗長な検証やバックトラックから,より生産的な探索へと推し進める。

論文の概要: STOP: Structured On-Policy Pruning of Long-Form Reasoning in Low-Data Regimes

関連論文リスト