Related papers: DeltaLLM: A Training-Free Framework Exploiting Temporal Sparsity for Efficient Edge LLM Inference

DeltaLLM: A Training-Free Framework Exploiting Temporal Sparsity for Efficient Edge LLM Inference

URL: http://arxiv.org/abs/2507.19608v1
Date: Fri, 25 Jul 2025 18:23:18 GMT
Title: DeltaLLM: A Training-Free Framework Exploiting Temporal Sparsity for Efficient Edge LLM Inference
Authors: Jiawen Qi, Chang Gao, Zhaochun Ren, Qinyu Chen,
Abstract summary: We present DeltaLLM, a training-free framework that exploits temporal sparsity in attention patterns to enable efficient LLM inference on resource-constrained edge devices.<n>We evaluate our framework on the edge-device-friendly BitNet-b1.58-2B-4T model and Llama3.2-1B-Instruct model across diverse language tasks.
Score: 19.987309147268586
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deploying Large Language Models (LLMs) on edge devices remains challenging due to their quadratically increasing computations with the sequence length. Existing studies for dynamic attention pruning are designed for hardware with massively parallel computation capabilities, such as GPUs or TPUs, and aim at long context lengths (e.g., 64K), making them unsuitable for edge scenarios. We present DeltaLLM, a training-free framework that exploits temporal sparsity in attention patterns to enable efficient LLM inference across both the prefilling and decoding stages, on resource-constrained edge devices. DeltaLLM introduces an accuracy- and memory-aware delta matrix construction strategy that introduces temporal sparsity, and a context-aware hybrid attention mechanism that combines full attention in a local context window with delta approximation outside it to increase accuracy. We evaluate our framework on the edge-device-friendly BitNet-b1.58-2B-4T model and Llama3.2-1B-Instruct model across diverse language tasks. The results show that on BitNet, our framework increases the attention sparsity from 0% to 60% during the prefilling stage with slight accuracy improvement on the WG task, and 0% to 57% across both the prefilling and decoding stages, with even higher F1 score from 29.63 to 30.97 on SQuAD-v2 task. On the Llama model, it can also achieve up to 60% sparsity during the prefilling stage and around 57% across both stages with negligible accuracy drop. These results demonstrate that DeltaLLM offers a promising solution for efficient edge deployment, requiring no fine-tuning and seamlessly integrating with existing inference pipelines.

Related papers

Inter2Former: Dynamic Hybrid Attention for Efficient High-Precision Interactive [58.0729162588429]
Interactive segmentation improves annotation efficiency by segmenting target regions from user prompts.<n>Current approaches face a critical trade-off: dense-token methods achieve superior accuracy but suffer from prohibitively slow processing on CPU devices.<n>We propose Inter2Former to address this challenge by optimizing computation allocation in dense-token processing.
arXiv Detail & Related papers (2025-07-13T12:33:37Z)
Hybrid Systolic Array Accelerator with Optimized Dataflow for Edge Large Language Model Inference [8.475319961845903]
Edge accelerator should achieve high area efficiency and minimize external memory access.<n>This paper proposes an edge LLM inference accelerator featuring a hybrid systolic array architecture.<n>Our solution achieves 247/117 (token/s/mm2) while running a 1.3B LLM on long-input/long-output scenarios.
arXiv Detail & Related papers (2025-07-11T20:27:30Z)
Scaling Attention to Very Long Sequences in Linear Time with Wavelet-Enhanced Random Spectral Attention (WERSA) [1.7622426179653563]
Transformer models are computationally costly on long sequences since regular attention has quadratic $O(n2)$ time complexity.<n>We introduce Wavelet-Enhanced Random Spectral Attention (WERSA), a novel mechanism of linear $O(n)$ time complexity.<n>By significantly reducing computational loads without compromising accuracy, WERSA makes possible more practical, more affordable, long-context models.
arXiv Detail & Related papers (2025-07-11T14:40:40Z)
Learning Adaptive Parallel Reasoning with Language Models [70.1745752819628]
We propose Adaptive Parallel Reasoning (APR), a novel reasoning framework that enables language models to orchestrate both serialized and parallel computations end-to-end.<n> APR generalizes existing reasoning methods by enabling adaptive multi-threaded inference using spawn() and join() operations.<n>A key innovation is our end-to-end reinforcement learning strategy, optimizing both parent and child inference threads to enhance task success rate without requiring predefined reasoning structures.
arXiv Detail & Related papers (2025-04-21T22:29:02Z)
Fragile Mastery: Are Domain-Specific Trade-Offs Undermining On-Device Language Models? [0.0]
Generalized Edge Model (GEM) aims to balance robustness and generalization in a harmonious manner.<n>GEM employs a Sparse Cross-Attention Router (SCAR) to dynamically allocate to a variable number of computing resources.<n>Compared to GPT-4 Lite, GEM enhances the general-task level by 7% with respect and parity in domain-specific performance.
arXiv Detail & Related papers (2025-03-16T18:30:26Z)
ParallelComp: Parallel Long-Context Compressor for Length Extrapolation [51.68913021512016]
Extrapolating ultra-long contexts (text length >128K) remains a major challenge for large language models (LLMs)<n>In this work, we propose ParallelComp, a parallel long-context compression method that effectively overcomes the memory bottleneck.<n>We achieve a 1.76x improvement in chunk throughput, thereby achieving a 23.50x acceleration in the prefill stage with negligible performance loss.
arXiv Detail & Related papers (2025-02-20T07:10:43Z)
CMoE: Converting Mixture-of-Experts from Dense to Accelerate LLM Inference [33.871080938643566]
We present CMoE, a framework that rapidly transforms dense language models into mixture-of-experts (MoEs) without training.<n>Experiments demonstrate that, with activation ratio of 75%, it achieves remarkable results in terms of perplexity.<n>A CMoE configuration activating just 25% of parameters reduces end-to-end latency by 1.5x while preserving usable perplexity without additional training.
arXiv Detail & Related papers (2025-02-06T14:05:30Z)
Building Math Agents with Multi-Turn Iterative Preference Learning [56.71330214021884]
This paper studies the complementary direct preference learning approach to further improve model performance.<n>Existing direct preference learning algorithms are originally designed for the single-turn chat task.<n>We introduce a multi-turn direct preference learning framework, tailored for this context.
arXiv Detail & Related papers (2024-09-04T02:41:04Z)
EdgeOL: Efficient in-situ Online Learning on Edge Devices [51.86178757050963]
We propose EdgeOL, an edge online learning framework that optimize inference accuracy, fine-tuning execution time, and energy efficiency.<n> Experimental results show that, on average, EdgeOL reduces overall fine-tuning execution time by 64%, energy consumption by 52%, and improves average inference accuracy by 1.75% over the immediate online learning strategy.
arXiv Detail & Related papers (2024-01-30T02:41:05Z)
DSEE: Dually Sparsity-embedded Efficient Tuning of Pre-trained Language Models [152.29364079385635]
As pre-trained models grow bigger, the fine-tuning process can be time-consuming and computationally expensive. We propose a framework for resource- and parameter-efficient fine-tuning by leveraging the sparsity prior in both weight updates and the final model weights. Our proposed framework, dubbed Dually Sparsity-Embedded Efficient Tuning (DSEE), aims to achieve two key objectives: (i) parameter efficient fine-tuning and (ii) resource-efficient inference.
arXiv Detail & Related papers (2021-10-30T03:29:47Z)
Multi-Exit Semantic Segmentation Networks [78.44441236864057]
We propose a framework for converting state-of-the-art segmentation models to MESS networks. specially trained CNNs that employ parametrised early exits along their depth to save during inference on easier samples. We co-optimise the number, placement and architecture of the attached segmentation heads, along with the exit policy, to adapt to the device capabilities and application-specific requirements.
arXiv Detail & Related papers (2021-06-07T11:37:03Z)
Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of Partitioned Edge Learning [73.82875010696849]
Machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models. This paper focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation.
arXiv Detail & Related papers (2020-03-10T05:52:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.