Probe and Skip: Self-Predictive Token Skipping for Efficient Long-Context LLM Inference
- URL: http://arxiv.org/abs/2601.13155v1
- Date: Mon, 19 Jan 2026 15:34:29 GMT
- Title: Probe and Skip: Self-Predictive Token Skipping for Efficient Long-Context LLM Inference
- Authors: Zimeng Wu, Donghao Wang, Chaozhe Jin, Jiaxin Chen, Yunhong Wang,
- Abstract summary: Token-oriented methods, such as pruning and skipping, have shown promise in reducing inference latency.<n>We propose SPTS (Self-Predictive Token Skipping), a training-free framework for efficient long-context inference.
- Score: 29.81657023400426
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Long-context inference enhances the reasoning capability of Large Language Models (LLMs) while incurring significant computational overhead. Token-oriented methods, such as pruning and skipping, have shown promise in reducing inference latency, but still suffer from inherently limited acceleration potential, outdated proxy signals, and redundancy interference, thus yielding suboptimal speed-accuracy trade-offs. To address these challenges, we propose SPTS (Self-Predictive Token Skipping), a training-free framework for efficient long-context LLM inference. Specifically, motivated by the thought of probing the influence of targeted skipping layers, we design two component-specific strategies for selective token skipping: Partial Attention Probing (PAP) for multi-head attention, which selects informative tokens by performing partial forward attention computation, and Low-rank Transformation Probing (LTP) for feed forward network, which constructs a low-rank proxy network to predict token transformations. Furthermore, a Multi-Stage Delayed Pruning (MSDP) strategy reallocates the skipping budget and progressively prunes redundant tokens across layers. Extensive experiments demonstrate the effectiveness of our method, achieving up to 2.46$\times$ and 2.29$\times$ speedups for prefilling and end-to-end generation, respectively, while maintaining state-of-the-art model performance. The source code will be publicly available upon paper acceptance.
Related papers
- Training-free Context-adaptive Attention for Efficient Long Context Modeling [57.703159205740185]
Training-free Context-adaptive Attention (TCA-Attention) is a training-free sparse attention mechanism that selectively attends to only the informative tokens for efficient long-context inference.<n>TCA-Attention achieves a 2.8$times$ speedup and reduces KV cache by 61% at 128K context length while maintaining performance comparable to full attention.
arXiv Detail & Related papers (2025-12-10T01:54:57Z) - Arbitrage: Efficient Reasoning via Advantage-Aware Speculation [71.45710345765528]
Speculative Decoding accelerates inference by employing a fast but inaccurate draft model to autoregressively propose tokens.<n>But due to unnecessary rejections caused by token mismatches in semantically equivalent steps, traditional token-level Speculative Decoding struggles in reasoning tasks.<n>We propose Arbitrage, a novel step-level speculative generation framework that routes generation dynamically based on the relative advantage between draft and target models.
arXiv Detail & Related papers (2025-12-04T17:50:53Z) - TokenSqueeze: Performance-Preserving Compression for Reasoning LLMs [57.217593337454026]
TokenSqueeze is a novel Long2Short method that condenses reasoning paths while preserving performance and relying exclusively on self-generated data.<n>We show that TokenSqueeze reduces token usage while maintaining accuracy on the MATH500 benchmark.
arXiv Detail & Related papers (2025-11-17T10:38:56Z) - DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning [6.468843780300177]
We present textbfDELTA, a training-free sparse attention mechanism that achieves computational efficiency without sacrificing model accuracy.<n>Our results show that selective reuse of intermediate attention maps offers a robust path toward efficient long-context reasoning.
arXiv Detail & Related papers (2025-10-10T21:37:49Z) - Informed Routing in LLMs: Smarter Token-Level Computation for Faster Inference [7.690958366125321]
This paper introduces informed routing, a new paradigm that proactively addresses these issues.<n>We propose the Lightweight Feature Forecaster (LFF), a small predictive module that estimates a unit's output before routing decisions are made.<n>Experiments on both language modeling and reasoning tasks show that informed routing achieves state-of-the-art efficiency-performance trade-offs.
arXiv Detail & Related papers (2025-10-10T09:59:36Z) - READER: Retrieval-Assisted Drafter for Efficient LLM Inference [0.0386965802948046]
Autoregressive Language Models instantiate a factorized likelihood over token sequences, yet their strictly sequential decoding process imposes an intrinsic lower bound on latency inference.<n>This bottleneck has emerged as a central obstacle to the scalable deployment of large-scale generative models.<n>We present READER, a speculative decoding framework that bypasses the training of the auxiliary draft model.
arXiv Detail & Related papers (2025-08-12T16:47:48Z) - R-Stitch: Dynamic Trajectory Stitching for Efficient Reasoning [80.104336426172]
Chain-of-thought (CoT) enhances problem-solving ability of large language models.<n>CoT incurs substantial inference cost due to long autoregressive trajectories.<n>We introduce R-Stitch, a training-free hybrid decoding framework.
arXiv Detail & Related papers (2025-07-23T08:14:36Z) - Can Prompt Difficulty be Online Predicted for Accelerating RL Finetuning of Reasoning Models? [65.18157595903124]
This work investigates iterative approximate evaluation for arbitrary prompts.<n>It introduces Model Predictive Prompt Selection (MoPPS), a Bayesian risk-predictive framework.<n>MoPPS reliably predicts prompt difficulty and accelerates training with significantly reduced rollouts.
arXiv Detail & Related papers (2025-07-07T03:20:52Z) - L-MTP: Leap Multi-Token Prediction Beyond Adjacent Context for Large Language Models [95.53699156138435]
We propose leap multi-token prediction(L-MTP), an innovative token prediction method.<n>Unlike conventional MTP, L-MTP strategically skips over intermediate tokens, predicting non-sequential ones in a single forward pass.<n>We theoretically demonstrate the benefit of L-MTP in improving inference efficiency.
arXiv Detail & Related papers (2025-05-23T05:59:46Z) - Saliency-driven Dynamic Token Pruning for Large Language Models [32.903622070917194]
Saliency-driven Dynamic Token Pruning (SDTP)<n>A lightweight saliency-driven prediction module is designed to estimate the importance score of each token with its hidden state.<n>A ranking-based optimization strategy is proposed to minimize the ranking divergence of the saliency score and the predicted importance score.
arXiv Detail & Related papers (2025-04-06T15:15:07Z) - FTP: A Fine-grained Token-wise Pruner for Large Language Models via Token Routing [17.01412432658081]
Large language models (LLMs) have demonstrated superior performance across various tasks by adhering to scaling laws.<n>We propose a fine-grained token-wise pruning approach for the LLMs, which presents a learnable router to adaptively identify the less important tokens.<n>Our approach achieves state-of-the-art (SOTA) pruning results, surpassing other existing pruning methods.
arXiv Detail & Related papers (2024-12-16T07:09:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.