Related papers: SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning

SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning

URL: http://arxiv.org/abs/2508.06447v1
Date: Fri, 08 Aug 2025 16:42:38 GMT
Title: SlimInfer: Accelerating Long-Context LLM Inference via Dynamic Token Pruning
Authors: Lingkun Long, Rubing Yang, Yushi Huang, Desheng Hui, Ao Zhou, Jianlei Yang,
Abstract summary: SlimInfer aims to accelerate inference by directly pruning less critical prompt tokens during the forward pass.<n>We show that SlimInfer can achieve up to $mathbf2.53times$ time-to-first-token (TTFT) speedup and $mathbf1.88times$ end-to-end latency reduction for LLaMA3.1-8B-Instruct.
Score: 3.502168555273189
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Long-context inference for Large Language Models (LLMs) is heavily limited by high computational demands. While several existing methods optimize attention computation, they still process the full set of hidden states at each layer, limiting overall efficiency. In this work, we propose SlimInfer, an innovative framework that aims to accelerate inference by directly pruning less critical prompt tokens during the forward pass. Our key insight is an information diffusion phenomenon: As information from critical tokens propagates through layers, it becomes distributed across the entire sequence. This diffusion process suggests that LLMs can maintain their semantic integrity when excessive tokens, even including these critical ones, are pruned in hidden states. Motivated by this, SlimInfer introduces a dynamic fine-grained pruning mechanism that accurately removes redundant tokens of hidden state at intermediate layers. This layer-wise pruning naturally enables an asynchronous KV cache manager that prefetches required token blocks without complex predictors, reducing both memory usage and I/O costs. Extensive experiments show that SlimInfer can achieve up to $\mathbf{2.53\times}$ time-to-first-token (TTFT) speedup and $\mathbf{1.88\times}$ end-to-end latency reduction for LLaMA3.1-8B-Instruct on a single RTX 4090, without sacrificing performance on LongBench. Our code will be released upon acceptance.

Related papers

Focus-dLLM: Accelerating Long-Context Diffusion LLM Inference via Confidence-Guided Context Focusing [18.405286688847827]
Diffusion Large Language Models (dLLMs) deliver strong long-context processing capability in a non-autoregressive decoding paradigm.<n>We present Focus-dLLM, a novel training-free attention sparsification framework tailored for accurate and efficient long-context dLLM inference.
arXiv Detail & Related papers (2026-02-02T14:36:10Z)
Window-Diffusion: Accelerating Diffusion Language Model Inference with Windowed Token Pruning and Caching [10.315266731366123]
We present a window-based token pruning and caching method for inference.<n>Experiments on LLaDA and Dream show that, under matched compute budgets, our method achieves up to $99times$ inference speedup.
arXiv Detail & Related papers (2026-01-28T07:49:20Z)
Behavior-Equivalent Token: Single-Token Replacement for Long Prompts in LLMs [55.827877498548965]
We propose a lightweight training framework that learns a single prompt-specific Behavior-Equivalent token ([BE])<n>The framework first trains [BE] to encode the natural-language content of the original system prompt via reconstruction, and then distills the prompt's downstream behavior into this single token.<n> Empirical evaluations on three datasets show that a single [BE] token achieves up to a 3000x reduction in prompt length, while retaining about 98% of the downstream performance of the original system prompts.
arXiv Detail & Related papers (2025-11-28T15:22:52Z)
TokenSqueeze: Performance-Preserving Compression for Reasoning LLMs [57.217593337454026]
TokenSqueeze is a novel Long2Short method that condenses reasoning paths while preserving performance and relying exclusively on self-generated data.<n>We show that TokenSqueeze reduces token usage while maintaining accuracy on the MATH500 benchmark.
arXiv Detail & Related papers (2025-11-17T10:38:56Z)
FrugalPrompt: Reducing Contextual Overhead in Large Language Models via Token Attribution [3.4666771782038652]
Large language models (LLMs) owe much of their stellar performance to expansive input contexts, yet such verbosity inflates monetary costs, carbon footprint, and inference-time latency.<n>We introduce FrugalPrompt, a novel prompt compression framework for LLMs, which retains only the most semantically significant tokens.<n>We evaluate the approach across four NLP tasks: Sentiment Analysis, Commonsense QA, Summarization, and Mathematical Reasoning.
arXiv Detail & Related papers (2025-10-18T10:22:13Z)
Attention Is All You Need for KV Cache in Diffusion LLMs [36.94369617373333]
Elastic-Cache performs adaptive, layer-aware cache updates for diffusion large language models.<n>Our method achieves significantly higher throughput ($6.8times$ on GSM8K) than existing confidence-based approaches.
arXiv Detail & Related papers (2025-10-16T17:59:48Z)
Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction [58.044803442346115]
Diffusion Large Language Models (dLLMs) enable breakthroughs in reasoning and parallel decoding but suffer from prohibitive computational complexity and memory overhead during inference.<n>We propose Sparse-dLLM, the first training-free framework integrating dynamic cache eviction with sparse attention via delayed bidirectional sparse caching.
arXiv Detail & Related papers (2025-08-04T16:14:03Z)
InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU [48.105361428245736]
We introduce InfiniteHiP, an inference framework for large language models (LLMs)<n>We dynamically eliminate irrelevant context tokens through a modular hierarchical token pruning algorithm.<n>Our framework achieves an 18.95x speedup in attention decoding for a 1 million token context without requiring additional training.
arXiv Detail & Related papers (2025-02-13T02:52:01Z)
FIRP: Faster LLM inference via future intermediate representation prediction [54.897493351694195]
FIRP generates multiple tokens instead of one at each decoding step. We conduct extensive experiments, showing a speedup ratio of 1.9x-3x in several models and datasets.
arXiv Detail & Related papers (2024-10-27T15:53:49Z)
HSR-Enhanced Sparse Attention Acceleration [19.776342074253435]
We introduce a novel approach to accelerate attention computation in Large Language Models (LLMs)<n>We leverage the inherent sparsity within attention mechanisms, both in conventional Softmax attention and ReLU attention.<n>Our method only introduces provably negligible error for Softmax attention.
arXiv Detail & Related papers (2024-10-14T05:18:02Z)
Training-Free Exponential Context Extension via Cascading KV Cache [49.608367376911694]
We introduce a novel mechanism that leverages cascading sub-cache buffers to selectively retain the most relevant tokens.<n>Our method reduces prefill stage latency by a factor of 6.8 when compared to flash attention on 1M tokens.
arXiv Detail & Related papers (2024-06-24T03:59:17Z)
A Training-free Sub-quadratic Cost Transformer Model Serving Framework With Hierarchically Pruned Attention [43.211427581302715]
We propose Hierarchically Pruned Attention (HiP) to increase context length in large language models.<n>HiP reduces the time complexity of the attention mechanism to $O(T log T)$ and the space complexity to $O(T)$, where $T$ is the sequence length.<n>We show that HiP significantly reduces both prefill and decoding latencies, as well as memory usage, while maintaining high-quality generation with minimal degradation.
arXiv Detail & Related papers (2024-06-14T08:32:45Z)
FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping [49.66872823080736]
Autoregressive Large Language Models (e.g., LLaMa, GPTs) are omnipresent achieving remarkable success in language understanding and generation. To mitigate overload incurred during generation, several early-exit and layer-dropping strategies have been proposed. We propose FFN-SkipLLM, which is an input-adaptive feed-forward skipping strategy.
arXiv Detail & Related papers (2024-04-05T02:35:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.