Related papers: TokenSwift: Lossless Acceleration of Ultra Long Sequence Generation

TokenSwift: Lossless Acceleration of Ultra Long Sequence Generation

URL: http://arxiv.org/abs/2502.18890v2
Date: Wed, 09 Jul 2025 02:35:21 GMT
Title: TokenSwift: Lossless Acceleration of Ultra Long Sequence Generation
Authors: Tong Wu, Junzhe Shen, Zixia Jia, Yuxuan Wang, Zilong Zheng,
Abstract summary: TOKENSWIFT is designed to substantially accelerate the generation process of ultra-long sequences.<n>It achieves over 3 times speedup across models of varying scales.<n>It translates to hours of time savings for ultra-long sequence generation.
Score: 26.79477846621806
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Generating ultra-long sequences with large language models (LLMs) has become increasingly crucial but remains a highly time-intensive task, particularly for sequences up to 100K tokens. While traditional speculative decoding methods exist, simply extending their generation limits fails to accelerate the process and can be detrimental. Through an in-depth analysis, we identify three major challenges hindering efficient generation: frequent model reloading, dynamic key-value (KV) management and repetitive generation. To address these issues, we introduce TOKENSWIFT, a novel framework designed to substantially accelerate the generation process of ultra-long sequences while maintaining the target model's inherent quality. Experimental results demonstrate that TOKENSWIFT achieves over 3 times speedup across models of varying scales (1.5B, 7B, 8B, 14B) and architectures (MHA, GQA). This acceleration translates to hours of time savings for ultra-long sequence generation, establishing TOKENSWIFT as a scalable and effective solution at unprecedented lengths. Code can be found at https://github.com/bigai-nlco/TokenSwift.

Related papers

TriSpec: Ternary Speculative Decoding via Lightweight Proxy Verification [63.65902785448346]
Speculative decoding offers significant speed-ups through its lightweight drafting and parallel verification mechanism.<n>We propose TriSpec, a novel ternary SD framework that introduces a lightweight proxy to significantly reduce computational cost.<n>Experiments on the Qwen3 and DeepSeek-R1-Distill-Qwen/LLaMA families show that TriSpec achieves up to 35% speedup over standard SD.
arXiv Detail & Related papers (2026-01-30T17:04:18Z)
HiFi-Mesh: High-Fidelity Efficient 3D Mesh Generation via Compact Autoregressive Dependence [36.403921772528236]
We introduce the Latent Autoregressive Network (LANE), which incorporates compact autoregressive dependencies in the generation process.<n>LANE achieves a $6times$ improvement in maximum sequence length compared to existing methods.
arXiv Detail & Related papers (2026-01-29T06:22:26Z)
SpecPV: Improving Self-Speculative Decoding for Long-Context Generation via Partial Verification [11.366541829206199]
Speculative decoding is one of the most direct and effective approaches for accelerating generation.<n>We introduce SpecPV, a self-speculative decoding approach that performs fast verification using partial key-value states.<n>We validate SpecPV across multiple long-context benchmarks and models, including LLaMA-3.1-8B-Instruct and Qwen3-series.
arXiv Detail & Related papers (2025-12-02T02:15:33Z)
Fast Inference via Hierarchical Speculative Decoding [65.40448210801763]
We introduce Hierarchical Speculative Decoding (HSD), an algorithm that stacks draft models into a hierarchy, where each model proposes tokens, and the next larger model verifies them in a single forward pass.<n>HSD gives up to 1.2x speed-up over the best single-draft baseline.
arXiv Detail & Related papers (2025-10-22T15:56:19Z)
Beyond Fixed: Variable-Length Denoising for Diffusion Large Language Models [74.15250326312179]
Diffusion Large Language Models offer efficient parallel generation and capable global modeling.<n>The dominant application ofDLLMs is hindered by the need for a statically predefined generation length.<n>We introduce DAEDAL, a novel training-free denoising strategy that enables Dynamic Adaptive Length Expansion.
arXiv Detail & Related papers (2025-08-01T17:56:07Z)
SpecExtend: A Drop-in Enhancement for Speculative Decoding of Long Sequences [4.268504966623081]
We introduce SpecExtend, a drop-in enhancement that improves the performance of speculative decoding on long sequences.<n>SpecExtend integrates efficient attention mechanisms such as FlashAttention and Hybrid Tree Attention into both the draft and target models.<n>To improve draft accuracy and speed, we propose Cross-model Retrieval, a novel KV cache update strategy.
arXiv Detail & Related papers (2025-05-27T06:30:00Z)
DuoDecoding: Hardware-aware Heterogeneous Speculative Decoding with Dynamic Multi-Sequence Drafting [59.57151419673759]
Speculative decoding presents a draft-then-verify framework that reduces generation latency while maintaining output distribution fidelity. We propose DuoDecoding, a novel approach that strategically deploys the draft and target models on the CPU and GPU respectively. Our method incorporates a hardware-aware optimal draft budget to minimize idle times and employs dynamic multi-sequence drafting to enhance draft quality.
arXiv Detail & Related papers (2025-03-02T08:27:48Z)
Sliding Window Attention Training for Efficient Large Language Models [55.56483740523027]
We introduce SWAT, which enables efficient long-context handling via Sliding Window Attention Training.<n>This paper first attributes the inefficiency of Transformers to the attention sink phenomenon.<n>We replace softmax with the sigmoid function and utilize a balanced ALiBi and Rotary Position Embedding for efficient information compression and retention.
arXiv Detail & Related papers (2025-02-26T05:31:44Z)
LongSpec: Long-Context Speculative Decoding with Efficient Drafting and Verification [42.54363549922909]
Speculative decoding has become a promising technique to mitigate the high inference latency of autoregressive decoding in Large Language Models.<n>Despite its promise, the effective application of speculative decoding in LLMs still confronts three key challenges.<n>We enhance the performance of speculative decoding in long-context settings by addressing these challenges.
arXiv Detail & Related papers (2025-02-24T18:53:31Z)
Apollo-Forecast: Overcoming Aliasing and Inference Speed Challenges in Language Models for Time Series Forecasting [16.177920916883565]
Anti-Aliasing Quantization Module (AAQM) and Race Decoding (RD) technique are presented.<n>AAQM adeptly encodes sequences into tokens while mitigating high-frequency noise in the original signals.<n>RD employs a draft model to enable parallel processing and results integration, which markedly accelerates the inference speed for long-term predictions.
arXiv Detail & Related papers (2024-12-16T11:01:20Z)
RefreshKV: Updating Small KV Cache During Long-form Generation [54.00118604124301]
We propose a new inference method, RefreshKV, that flexibly alternates between full context attention and attention over a subset of input tokens during generation. Applying our method to off-the-shelf LLMs achieves comparable speedup to eviction-based methods while improving performance for various long-form generation tasks.
arXiv Detail & Related papers (2024-11-08T18:57:07Z)
LANTERN: Accelerating Visual Autoregressive Models with Relaxed Speculative Decoding [30.630803933771865]
Experimental results demonstrate the efficacy of our method in providing a substantial speed-up over speculative decoding.<n> LANTERN increases speed-ups by $mathbf1.75times$ and $mathbf1.82times$, as compared to greedy decoding and random sampling.
arXiv Detail & Related papers (2024-10-04T12:21:03Z)
CItruS: Chunked Instruction-aware State Eviction for Long Sequence Modeling [52.404072802235234]
We introduce Chunked Instruction-aware State Eviction (CItruS), a novel modeling technique that integrates the attention preferences useful for a downstream task into the eviction process of hidden states. Our training-free method exhibits superior performance on long sequence comprehension and retrieval tasks over several strong baselines under the same memory budget.
arXiv Detail & Related papers (2024-06-17T18:34:58Z)
Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass. In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z)
LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory [63.41820940103348]
Self-attention mechanism's computational cost limits its practicality for long sequences. We propose a new method called LongVQ to compress the global abstraction as a length-fixed codebook. LongVQ effectively maintains dynamic global and local patterns, which helps to complement the lack of long-range dependency issues.
arXiv Detail & Related papers (2024-04-17T08:26:34Z)
Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep. We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z)
Don't Take It Literally: An Edit-Invariant Sequence Loss for Text Generation [109.46348908829697]
We propose a novel Edit-Invariant Sequence Loss (EISL), which computes the matching loss of a target n-gram with all n-grams in the generated sequence. We conduct experiments on three tasks: machine translation with noisy target sequences, unsupervised text style transfer, and non-autoregressive machine translation.
arXiv Detail & Related papers (2021-06-29T03:59:21Z)
Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting [25.417560221400347]
Long sequence time-series forecasting (LSTF) demands a high prediction capacity. Recent studies have shown the potential of Transformer to increase the prediction capacity. We design an efficient transformer-based model for LSTF, named Informer, with three distinctive characteristics.
arXiv Detail & Related papers (2020-12-14T11:43:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.