Fugu-MT 論文翻訳(概要): SDFP: Speculative Decoding with FIT-Pruned Models for Training-Free and Plug-and-Play LLM Acceleration

論文の概要: SDFP: Speculative Decoding with FIT-Pruned Models for Training-Free and Plug-and-Play LLM Acceleration

arxiv url: http://arxiv.org/abs/2602.05499v1
Date: Thu, 05 Feb 2026 10:02:00 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-06 18:49:08.875547
Title: SDFP: Speculative Decoding with FIT-Pruned Models for Training-Free and Plug-and-Play LLM Acceleration
Title（参考訳）: SDFP:フリー・プラグ・アンド・プレイLDM高速化のためのFIT処理モデルによる投機的復号化
Authors: Hanyu Wei, Zunhai Su, Peng Lu, Chao Li, Spandan Tiwari, Ashish Sirasao, Yuhan Dong,
Abstract要約: 大型言語モデル(LLM)は、キャプション、検索、レコメンデーション、クリエイティブコンテンツ生成といったインタラクティブなマルチメディアアプリケーションを支える。投機的復号化は、軽量なドラフトモデルを使用してレイテンシを低減するが、効果的なドラフトモデルを取得し、チューニングし、維持するコストと複雑さによって、デプロイメントは制限されることが多い。我々は,FIT(Fisher Information Trace)をベースとしたLLMのレイヤプルーニングによるドラフトモデルを構築する,完全トレーニングフリーでプラグイン・アンド・プレイのフレームワークであるSDFPを提案する。
参考スコア（独自算出の注目度）: 13.369324372222735
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) underpin interactive multimedia applications such as captioning, retrieval, recommendation, and creative content generation, yet their autoregressive decoding incurs substantial latency. Speculative decoding reduces latency using a lightweight draft model, but deployment is often limited by the cost and complexity of acquiring, tuning, and maintaining an effective draft model. Recent approaches usually require auxiliary training or specialization, and even training-free methods incur costly search or optimization. We propose SDFP, a fully training-free and plug-and-play framework that builds the draft model via Fisher Information Trace (FIT)-based layer pruning of a given LLM. Using layer sensitivity as a proxy for output perturbation, SDFP removes low-impact layers to obtain a compact draft while preserving compatibility with the original model for standard speculative verification. SDFP needs no additional training, hyperparameter tuning, or separately maintained drafts, enabling rapid, deployment-friendly draft construction. Across benchmarks, SDFP delivers 1.32x-1.5x decoding speedup without altering the target model's output distribution, supporting low-latency multimedia applications.
Abstract（参考訳）: 大規模言語モデル(LLM)は、キャプション、検索、レコメンデーション、クリエイティブコンテンツ生成などのインタラクティブなマルチメディアアプリケーションを支えるが、その自己回帰デコードにはかなりの遅延がある。投機的復号化は、軽量なドラフトモデルを使用してレイテンシを低減するが、効果的なドラフトモデルを取得し、チューニングし、維持するコストと複雑さによって、デプロイメントは制限されることが多い。最近の手法では、通常補助的な訓練や専門化が必要であり、訓練のない方法でさえコストのかかる探索や最適化を必要とする。我々は,FIT(Fisher Information Trace)ベースのLLM層プルーニングによるドラフトモデルを構築する,フルトレーニングフリーでプラグイン・アンド・プレイのフレームワークであるSDFPを提案する。 SDFPは出力摂動のプロキシとして層感度を使用し、標準投機的検証のために元のモデルとの互換性を維持しながら、低インパクト層を除去してコンパクトなドラフトを得る。 SDFPは、追加のトレーニング、ハイパーパラメータチューニング、あるいは個別にメンテナンスされたドラフトを必要としないため、迅速なデプロイメントフレンドリーなドラフト構築を可能にしている。ベンチマーク全体で、SDFPはターゲットモデルの出力分布を変更することなく1.32x-1.5xデコードスピードアップを提供し、低レイテンシのマルチメディアアプリケーションをサポートする。

論文の概要: SDFP: Speculative Decoding with FIT-Pruned Models for Training-Free and Plug-and-Play LLM Acceleration

関連論文リスト