Related papers: Accelerating Inference in Large Language Models with a Unified Layer Skipping Strategy

Accelerating Inference in Large Language Models with a Unified Layer Skipping Strategy

URL: http://arxiv.org/abs/2404.06954v1
Date: Wed, 10 Apr 2024 12:12:07 GMT
Title: Accelerating Inference in Large Language Models with a Unified Layer Skipping Strategy
Authors: Yijin Liu, Fandong Meng, Jie Zhou,
Abstract summary: Dynamic computation methods have shown notable acceleration for Large Language Models (LLMs) by skipping several layers of computations. We propose a Unified Layer Skipping strategy, which selects the number of layers to skip computation based solely on the target speedup ratio. Experimental results on two common tasks, i.e., machine translation and text summarization, indicate that given a target speedup ratio, the Unified Layer Skipping strategy significantly enhances both the inference performance and the actual model throughput.
Score: 67.45518210171024
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, dynamic computation methods have shown notable acceleration for Large Language Models (LLMs) by skipping several layers of computations through elaborate heuristics or additional predictors. However, in the decoding process of existing approaches, different samples are assigned different computational budgets, which cannot guarantee a stable and precise acceleration effect. Furthermore, existing approaches generally skip multiple contiguous layers at the bottom or top of the layers, leading to a drastic change in the model's layer-wise representations, and thus a consequent performance degeneration. Therefore, we propose a Unified Layer Skipping strategy, which selects the number of layers to skip computation based solely on the target speedup ratio, and then skips the corresponding number of intermediate layer computations in a balanced manner. Since the Unified Layer Skipping strategy is independent of input samples, it naturally supports popular acceleration techniques such as batch decoding and KV caching, thus demonstrating more practicality for real-world applications. Experimental results on two common tasks, i.e., machine translation and text summarization, indicate that given a target speedup ratio, the Unified Layer Skipping strategy significantly enhances both the inference performance and the actual model throughput over existing dynamic approaches.

Related papers

LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive. Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones. We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z)
Dynamic layer selection in decoder-only transformers [21.18795712840146]
We empirically examine two common dynamic inference methods for natural language generation. We find that a pre-trained decoder-only model is significantly more robust to layer removal via layer skipping. We also show that dynamic computation allocation on a per-sequence basis holds promise for significant efficiency gains.
arXiv Detail & Related papers (2024-10-26T00:44:11Z)
Towards Differentiable Multilevel Optimization: A Gradient-Based Approach [1.6114012813668932]
This paper introduces a novel gradient-based approach for multilevel optimization. Our method significantly reduces computational complexity while improving both solution accuracy and convergence speed. To the best of our knowledge, this is one of the first algorithms to provide a general version of implicit differentiation.
arXiv Detail & Related papers (2024-10-15T06:17:59Z)
A-SDM: Accelerating Stable Diffusion through Model Assembly and Feature Inheritance Strategies [51.7643024367548]
Stable Diffusion Model is a prevalent and effective model for text-to-image (T2I) and image-to-image (I2I) generation. This study focuses on reducing redundant computation in SDM and optimizing the model through both tuning and tuning-free methods.
arXiv Detail & Related papers (2024-05-31T21:47:05Z)
You Need Multiple Exiting: Dynamic Early Exiting for Accelerating Unified Vision Language Model [37.24203191658052]
Large-scale Transformer models bring significant improvements for various downstream vision language tasks with a unified architecture. Performance improvements come with increasing model size, resulting in slow inference speed and increased cost for severing. We propose a novel early exiting strategy for unified visual language models, which allows dynamically skip the layers in encoder and decoder simultaneously.
arXiv Detail & Related papers (2022-11-21T02:32:25Z)
Layer-Wise Partitioning and Merging for Efficient and Scalable Deep Learning [16.38731019298993]
We have proposed a novel layer-wise partitioning and merging, forward and backward pass parallel framework to provide better training performance. The experimental evaluation on real use cases shows that the proposed method outperforms the state-of-the-art approaches in terms of training speed.
arXiv Detail & Related papers (2022-07-22T11:47:34Z)
Faster One-Sample Stochastic Conditional Gradient Method for Composite Convex Minimization [61.26619639722804]
We propose a conditional gradient method (CGM) for minimizing convex finite-sum objectives formed as a sum of smooth and non-smooth terms. The proposed method, equipped with an average gradient (SAG) estimator, requires only one sample per iteration. Nevertheless, it guarantees fast convergence rates on par with more sophisticated variance reduction techniques.
arXiv Detail & Related papers (2022-02-26T19:10:48Z)
Layer Pruning on Demand with Intermediate CTC [50.509073206630994]
We present a training and pruning method for ASR based on the connectionist temporal classification (CTC) We show that a Transformer-CTC model can be pruned in various depth on demand, improving real-time factor from 0.005 to 0.002 on GPU.
arXiv Detail & Related papers (2021-06-17T02:40:18Z)
Layer Reduction: Accelerating Conformer-Based Self-Supervised Model via Layer Consistency [31.572652956170252]
Transformer-based self-supervised models are trained as feature extractors and have empowered many downstream speech tasks to achieve state-of-the-art performance. We experimentally achieve 7.8X parameter reduction, 41.9% training speedup and 37.7% inference speedup while maintaining comparable performance with conventional BERT-like self-supervised methods.
arXiv Detail & Related papers (2021-04-08T08:21:59Z)
Cogradient Descent for Bilinear Optimization [124.45816011848096]
We introduce a Cogradient Descent algorithm (CoGD) to address the bilinear problem. We solve one variable by considering its coupling relationship with the other, leading to a synchronous gradient descent. Our algorithm is applied to solve problems with one variable under the sparsity constraint.
arXiv Detail & Related papers (2020-06-16T13:41:54Z)
BERT Loses Patience: Fast and Robust Inference with Early Exit [91.26199404912019]
We propose Patience-based Early Exit as a plug-and-play technique to improve the efficiency and robustness of a pretrained language model. Our approach improves inference efficiency as it allows the model to make a prediction with fewer layers.
arXiv Detail & Related papers (2020-06-07T13:38:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.