Fugu-MT 論文翻訳(概要): Think Before You Prune: Self-Reflective Structured Pruning for Reasoning Language Models

論文の概要: Think Before You Prune: Self-Reflective Structured Pruning for Reasoning Language Models

arxiv url: http://arxiv.org/abs/2512.02185v1
Date: Mon, 01 Dec 2025 20:27:05 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-03 21:04:45.595452
Title: Think Before You Prune: Self-Reflective Structured Pruning for Reasoning Language Models
Title（参考訳）: 熟考する前に考える: 推論言語モデルのための自己表現型構造化プルーニング
Authors: Ziyan Wang, Enmao Diao, Qi Le, Pu Wang, Guanchu Wang, Minwoo Lee, Shu-ping Yeh, Li Yang,
Abstract要約: 推論 LLM (Reasoning LLMs) はチェーン・オブ・ソート・ジェネレーションを通じて強力な多段階推論を実現する。 RLMの大きなモデルサイズと長いデコードタイムのアウトプットは、リソース制約のある設定にデプロイするのにコストがかかり、不適当である。我々は、構造化されたプルーニングフレームワークであるRESPを紹介し、プルーニング決定とモデルの推論力学を一致させる。
参考スコア（独自算出の注目度）: 31.422773877490613
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reasoning LLMs (RLMs) such as OpenAI o1, DeepSeek-R1, and Qwen3 deliver strong multi-step reasoning through chain-of-thought generation, but their large model sizes and lengthy decode-time outputs make them costly to deploy and unsuitable for resource-constrained settings. To reduce computing and memory cost, pruning offers a promising solution by removing unimportant parameters. However, despite their success on standard LLMs, existing pruning methods severely damage RLMs, as even moderate sparsity (e.g., 20%) can collapse accuracy and completely disrupt the model's reasoning coherence. We begin by analyzing why existing pruning pipelines fail on reasoning LLMs and find that their brittleness largely stems from a mismatch between the calibration data, the pruning objective, and the model's decode-time reasoning behavior. Our study further shows that the most reliable calibration signal comes not from human-written labels but from the model's own self-generated reasoning traces, which more accurately reflect its inference distribution. Guided by these insights, we introduce RESP, a self-reflective structured pruning framework that aligns pruning decisions with the model's reasoning dynamics through self-generated calibration, decode-only gradient-based importance estimation, and progressive regeneration that maintains calibration fidelity as sparsity increases. Experiments on Qwen3-8B demonstrate that RESP markedly outperforms existing structured pruning methods on both GSM8K and MathQA, preserving near-dense accuracy at 20-30% sparsity and substantially mitigating performance collapse at higher sparsity levels. At 40% sparsity, RESP attains 81.3% accuracy on GSM8K and 59.6% on MathQA, surpassing the strongest baselines by 66.87% and 47%, respectively.
Abstract（参考訳）: OpenAI o1、DeepSeek-R1、Qwen3といったRLM(Reasoning LLM)は、チェーン・オブ・シンクソン・ジェネレーションを通じて強力なマルチステップ推論を提供するが、その大きなモデルサイズと長いデコードタイムアウトプットにより、リソース制約のある設定にデプロイし、適さないコストがかかる。計算とメモリコストを削減するため、プルーニングは重要でないパラメータを除去することで有望なソリューションを提供する。しかしながら、通常のLLMでの成功にもかかわらず、既存のプルーニング法はRLMに深刻なダメージを与え、中程度の間隔(例:20%)でさえ精度を低下させ、モデルの推論コヒーレンスを完全に破壊する可能性がある。まず、既存のプルーニングパイプラインがLCMの推論に失敗する理由を分析し、その脆さはキャリブレーションデータ、プルーニング目的、およびモデルのデコード時推論動作のミスマッチに起因すると判断する。さらに本研究では,最も信頼性の高いキャリブレーション信号は,人手によるラベルではなく,モデル自体の自己生成的推論トレースから来ており,推論分布をより正確に反映していることを示す。これらの知見に導かれたRESPは、自己生成キャリブレーション、デコードのみの勾配に基づく重要度推定、空間性の増加とともにキャリブレーションの忠実度を維持するプログレッシブ・リジェネレーションを通じて、プルーニング決定とモデルの推論力学を整合させる自己反射型構造化プルーニング・フレームワークである。 Qwen3-8Bの実験では、RESPはGSM8KとMathQAの両方で既存の構造化プルーニング法を著しく上回り、20～30%の間隔で近距離精度を保ち、より高い間隔で性能崩壊を著しく軽減することを示した。 40%の間隔で、RESPはGSM8Kで81.3%、MathQAで59.6%の精度を達成し、それぞれ66.87%と47%という最強のベースラインを超えた。

論文の概要: Think Before You Prune: Self-Reflective Structured Pruning for Reasoning Language Models

関連論文リスト