Related papers: Position-Aware Depth Decay Decoding ($D^3$): Boosting Large Language Model Inference Efficiency

Position-Aware Depth Decay Decoding ($D^3$): Boosting Large Language Model Inference Efficiency

URL: http://arxiv.org/abs/2503.08524v1
Date: Tue, 11 Mar 2025 15:15:54 GMT
Title: Position-Aware Depth Decay Decoding ($D^3$): Boosting Large Language Model Inference Efficiency
Authors: Siqi Fan, Xuezhi Fang, Xingrun Xing, Peng Han, Shuo Shang, Yequan Wang,
Abstract summary: A token-position aware layer skipping framework is proposed to save 1.5x times operations efficiently while maintaining performance.<n>Experiments on large language models with $7 sim 70$ billion parameters show that $D3$ can achieve an average 1.5x speedup compared with the full-inference pipeline.
Score: 26.173523821684306
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Due to the large number of parameters, the inference phase of Large Language Models (LLMs) is resource-intensive. Unlike traditional model compression, which needs retraining, recent dynamic computation methods show that not all components are required for inference, enabling a training-free pipeline. In this paper, we focus on the dynamic depth of LLM generation. A token-position aware layer skipping framework is proposed to save 1.5x times operations efficiently while maintaining performance. We first observed that tokens predicted later have lower perplexity and thus require less computation. Then, we propose a training-free algorithm called Position-Aware Depth Decay Decoding ($D^3$), which leverages a power-law decay function, $\left\lfloor L \times (\alpha^i) \right\rfloor$, to determine the number of layers to retain when generating token $T_i$. Remarkably, without any retraining, the $D^3$ achieves success across a wide range of generation tasks for the first time. Experiments on large language models (\ie the Llama) with $7 \sim 70$ billion parameters show that $D^3$ can achieve an average 1.5x speedup compared with the full-inference pipeline while maintaining comparable performance with nearly no performance drop ($<1\%$) on the GSM8K and BBH benchmarks.

Related papers

Towards Efficient Automatic Self-Pruning of Large Language Models [55.90119819642064]
Post-training structured pruning is a promising solution that prunes Large Language Models without the need for retraining.<n>We argue that the key to mitigating this issue lies in accurately determining the pruning rate for each layer.<n>We introduce $textbfSelf-Pruner$ an end-to-end automatic self-pruning framework for LLMs, which efficiently search layer-wise pruning rates.
arXiv Detail & Related papers (2025-02-20T09:59:50Z)
LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive.<n>Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones.<n>We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z)
Scaling Embedding Layers in Language Models [52.47659840377581]
SCONE enables two new scaling strategies: increasing the number of cached $n$-gram embeddings and scaling the model used to learn them, all while maintaining fixed inference-time FLOPS.<n>We show that scaling both aspects allows SCONE to outperform a 1.9B parameter baseline across diverse corpora, while using only half the inference-time FLOPS.
arXiv Detail & Related papers (2025-02-03T18:59:32Z)
SWAN: SGD with Normalization and Whitening Enables Stateless LLM Training [16.037614012166063]
Gradient Descent (SGD) is a stateless, scalability as it does not track state variables during training.<n>In this work, we show that pre-processing SGD in a stateless manner can achieve the same performance as the Adam for training LLM.<n>We show that normalization stabilizes gradient, and counteracts the local curvature of the loss landscape. This results in SWAN (SGD with Whitening And Normalization), aapprox that eliminates the need to store any states.
arXiv Detail & Related papers (2024-12-17T18:13:18Z)
HSR-Enhanced Sparse Attention Acceleration [19.776342074253435]
We introduce a novel approach to accelerate attention computation in Large Language Models (LLMs) We leverage the inherent sparsity within attention mechanisms, both in conventional Softmax attention and ReLU attention. Our method only introduces provably negligible error for Softmax attention.
arXiv Detail & Related papers (2024-10-14T05:18:02Z)
A Training-free Sub-quadratic Cost Transformer Model Serving Framework With Hierarchically Pruned Attention [43.211427581302715]
We propose Hierarchically Pruned Attention (HiP) to increase context length in large language models. HiP reduces the time complexity of the attention mechanism to $O(T log T)$ and the space complexity to $O(T)$, where $T$ is the sequence length. We show that HiP significantly reduces both prefill and decoding latencies, as well as memory usage, while maintaining high-quality generation with minimal degradation.
arXiv Detail & Related papers (2024-06-14T08:32:45Z)
Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference [19.167604927651073]
Auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance. We propose a novel parallel prompt decoding that requires only $0.0002$% trainable parameters, enabling efficient training on a single A100-40GB GPU in just 16 hours. Our approach demonstrates up to 2.49$times$ speedup and maintains a minimal memory overhead of just $0.0004$%.
arXiv Detail & Related papers (2024-05-28T22:19:30Z)
Stacking Your Transformers: A Closer Look at Model Growth for Efficient LLM Pre-Training [42.89066583603415]
This work identifies three critical $textitO$bstacles: lack of comprehensive evaluation, ($textitO$2) untested viability for scaling, and ($textitO$3) lack of empirical guidelines. We show that a depthwise stacking operator, called $G_textstack$, exhibits remarkable acceleration in training, leading to decreased loss and improved overall performance.
arXiv Detail & Related papers (2024-05-24T08:00:00Z)
Simplifying and Understanding State Space Models with Diagonal Linear RNNs [56.33053691749856]
This work disposes of the discretization step, and proposes a model based on vanilla Diagonal Linear RNNs. We empirically show that, despite being conceptually much simpler, $mathrmDLR$ is as performant as previously-proposed SSMs. We also characterize the expressivity of SSMs and attention-based models via a suite of $13$ synthetic sequence-to-sequence tasks.
arXiv Detail & Related papers (2022-12-01T18:53:06Z)
Training Multi-Layer Over-Parametrized Neural Network in Subquadratic Time [12.348083977777833]
We consider the problem of training a multi-layer over-parametrized neural network to minimize the empirical risk induced by a loss function. In this work, we show how to reduce the training cost per iteration.
arXiv Detail & Related papers (2021-12-14T18:13:36Z)
Provably Efficient Reinforcement Learning for Discounted MDPs with Feature Mapping [99.59319332864129]
In this paper, we study reinforcement learning for discounted Decision (MDP) We propose a novel algorithm that makes use of the feature mapping and obtains a $tilde O(dsqrtT/ (1-gamma)2)$ regret. Our upper and lower bound results together suggest that the proposed reinforcement learning algorithm is near-optimal up to a $ (1-gamma)-0.5$ factor.
arXiv Detail & Related papers (2020-06-23T17:08:54Z)
Backward Feature Correction: How Deep Learning Performs Deep (Hierarchical) Learning [66.05472746340142]
This paper analyzes how multi-layer neural networks can perform hierarchical learning _efficiently_ and _automatically_ by SGD on the training objective. We establish a new principle called "backward feature correction", where the errors in the lower-level features can be automatically corrected when training together with the higher-level layers.
arXiv Detail & Related papers (2020-01-13T17:28:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.