Fugu-MT 論文翻訳(概要): SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning

論文の概要: SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning

arxiv url: http://arxiv.org/abs/2508.06447v1
Date: Fri, 08 Aug 2025 16:42:38 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-11 20:39:06.308758
Title: SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning
Title（参考訳）: SlimInfer:動的トークンプルーニングによる長期LLM推論の高速化
Authors: Lingkun Long, Rubing Yang, Yushi Huang, Desheng Hui, Ao Zhou, Jianlei Yang,
Abstract要約: SlimInferは、フォワードパス中にあまり重要でないプロンプトトークンを直接プルーニングすることで推論を加速することを目的としている。 SlimInferは最大$mathbf2.53times$ time-to-first-token(TTFT)スピードアップと$mathbf1.88times$ end-to-end latency reduction for LLaMA3.1-8B-Instructを実現する。
参考スコア（独自算出の注目度）: 3.502168555273189
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Long-context inference for Large Language Models (LLMs) is heavily limited by high computational demands. While several existing methods optimize attention computation, they still process the full set of hidden states at each layer, limiting overall efficiency. In this work, we propose SlimInfer, an innovative framework that aims to accelerate inference by directly pruning less critical prompt tokens during the forward pass. Our key insight is an information diffusion phenomenon: As information from critical tokens propagates through layers, it becomes distributed across the entire sequence. This diffusion process suggests that LLMs can maintain their semantic integrity when excessive tokens, even including these critical ones, are pruned in hidden states. Motivated by this, SlimInfer introduces a dynamic fine-grained pruning mechanism that accurately removes redundant tokens of hidden state at intermediate layers. This layer-wise pruning naturally enables an asynchronous KV cache manager that prefetches required token blocks without complex predictors, reducing both memory usage and I/O costs. Extensive experiments show that SlimInfer can achieve up to $\mathbf{2.53\times}$ time-to-first-token (TTFT) speedup and $\mathbf{1.88\times}$ end-to-end latency reduction for LLaMA3.1-8B-Instruct on a single RTX 4090, without sacrificing performance on LongBench. Our code will be released upon acceptance.
Abstract（参考訳）: LLM(Long-Context Inference for Large Language Models)は、高い計算要求によって非常に制限される。既存のいくつかのメソッドが注意計算を最適化する一方で、各レイヤに隠された状態の完全なセットを処理し、全体的な効率を制限している。本研究では,前方通過時にあまり重要でないプロンプトトークンを直接プルーニングすることで推論を高速化することを目的とした,革新的なフレームワークであるSlimInferを提案する。私たちの重要な洞察は情報拡散現象である: クリティカルトークンからの情報が層を通して伝播するにつれて、それはシーケンス全体にわたって分散する。この拡散過程は、これらの臨界トークンを含む過剰なトークンが隠された状態にプルーニングされると、LSMは意味的整合性を維持することができることを示唆している。これに触発されたSlimInferは、中間層で隠された状態の冗長なトークンを正確に除去する動的きめ細かなプルーニング機構を導入している。このレイヤワイズプルーニングは、複雑な予測子なしで必要なトークンブロックをプリパッチする非同期KVキャッシュマネージャを自然に実現し、メモリ使用量とI/Oコストを削減します。 SlimInferは最大$\mathbf{2.53\times}$ Time-to-first-token (TTFT)スピードアップと$\mathbf{1.88\times}$ end-to-end latency reduction for LLaMA3.1-8B-Instruct for LLaMA3.1-8B-Instruct for a single RTX 4090。私たちのコードは受け入れ次第解放されます。

論文の概要: SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning

関連論文リスト