LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding
- URL: http://arxiv.org/abs/2404.16710v2
- Date: Mon, 29 Apr 2024 15:02:36 GMT
- Title: LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding
- Authors: Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Ahmed A Aly, Beidi Chen, Carole-Jean Wu,
- Abstract summary: LayerSkip is an end-to-end solution to speed-up inference of large language models (LLMs)
We show that this training recipe increases the accuracy of early exit at earlier layers, without adding any auxiliary layers or modules to the model.
We also present a novel self-speculative decoding solution where we exit at early layers and verify and correct with remaining layers of the model.
- Score: 13.747101397628887
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present LayerSkip, an end-to-end solution to speed-up inference of large language models (LLMs). First, during training we apply layer dropout, with low dropout rates for earlier layers and higher dropout rates for later layers, and an early exit loss where all transformer layers share the same exit. Second, during inference, we show that this training recipe increases the accuracy of early exit at earlier layers, without adding any auxiliary layers or modules to the model. Third, we present a novel self-speculative decoding solution where we exit at early layers and verify and correct with remaining layers of the model. Our proposed self-speculative decoding approach has less memory footprint than other speculative decoding approaches and benefits from shared compute and activations of the draft and verification stages. We run experiments on different Llama model sizes on different types of training: pretraining from scratch, continual pretraining, finetuning on specific data domain, and finetuning on specific task. We implement our inference solution and show speedups of up to 2.16x on summarization for CNN/DM documents, 1.82x on coding, and 2.0x on TOPv2 semantic parsing task.
Related papers
- DAISY: Data Adaptive Self-Supervised Early Exit for Speech Representation Models [55.608981341747246]
We introduce Data Adaptive Self-Supervised Early Exit (DAISY), an approach that decides when to exit based on the self-supervised loss.
Our analysis on the adaptivity of DAISY shows that the model exits early (using fewer layers) on clean data while exits late (using more layers) on noisy data.
arXiv Detail & Related papers (2024-06-08T12:58:13Z) - Entropy Guided Extrapolative Decoding to Improve Factuality in Large Language Models [55.45444773200529]
Large language models (LLMs) exhibit impressive natural language capabilities but suffer from hallucination.
Recent work has focused on decoding techniques to improve factuality during inference.
arXiv Detail & Related papers (2024-04-14T19:45:35Z) - Towards Efficient Fine-tuning of Pre-trained Code Models: An
Experimental Study and Beyond [52.656743602538825]
Fine-tuning pre-trained code models incurs a large computational cost.
We conduct an experimental study to explore what happens to layer-wise pre-trained representations and their encoded code knowledge during fine-tuning.
We propose Telly to efficiently fine-tune pre-trained code models via layer freezing.
arXiv Detail & Related papers (2023-04-11T13:34:13Z) - Boosting Low-Data Instance Segmentation by Unsupervised Pre-training
with Saliency Prompt [103.58323875748427]
This work offers a novel unsupervised pre-training solution for low-data regimes.
Inspired by the recent success of the Prompting technique, we introduce a new pre-training method that boosts QEIS models.
Experimental results show that our method significantly boosts several QEIS models on three datasets.
arXiv Detail & Related papers (2023-02-02T15:49:03Z) - Improving Reliability of Fine-tuning with Block-wise Optimisation [6.83082949264991]
Finetuning can be used to tackle domain-specific tasks by transferring knowledge.
We propose a novel block-wise optimization mechanism, which adapts the weights of a group of layers of a pre-trained model.
The proposed approaches are tested on an often-used dataset, Tf_flower.
arXiv Detail & Related papers (2023-01-15T16:20:18Z) - Improved Convergence Guarantees for Shallow Neural Networks [91.3755431537592]
We prove convergence of depth 2 neural networks, trained via gradient descent, to a global minimum.
Our model has the following features: regression with quadratic loss function, fully connected feedforward architecture, RelU activations, Gaussian data instances, adversarial labels.
They strongly suggest that, at least in our model, the convergence phenomenon extends well beyond the NTK regime''
arXiv Detail & Related papers (2022-12-05T14:47:52Z) - Learning to Weight Samples for Dynamic Early-exiting Networks [35.03752825893429]
Early exiting is an effective paradigm for improving the inference efficiency of deep networks.
Our work proposes to adopt a weight prediction network to weight the loss of different training samples at each exit.
We show that the proposed weighting mechanism consistently improves the trade-off between classification accuracy and inference efficiency.
arXiv Detail & Related papers (2022-09-17T10:46:32Z) - Embedding Recycling for Language Models [38.11465250435789]
We study how to decrease computational cost in such settings through embedding recycling (ER)
We propose caching an intermediate layer's output from a pretrained model and finetuning the remaining layers for new tasks.
We show that our method provides a 100% speedup during training and a 55-86% speedup for inference, and has negligible impacts on accuracy for text classification and entity recognition tasks in the scientific domain.
arXiv Detail & Related papers (2022-07-11T16:36:14Z) - LV-BERT: Exploiting Layer Variety for BERT [85.27287501885807]
We introduce convolution into the layer type set, which is experimentally found beneficial to pre-trained models.
We then adopt an evolutionary algorithm guided by pre-training accuracy to find the optimal architecture.
LV-BERT model obtained by our method outperforms BERT and its variants on various downstream tasks.
arXiv Detail & Related papers (2021-06-22T13:20:14Z) - Training with Multi-Layer Embeddings for Model Reduction [0.9046327456472286]
We introduce a multi-layer embedding training architecture that trains embeddings via a sequence of linear layers.
We show that it allows reducing d by 4-8X, with a corresponding improvement in memory footprint, at given model accuracy.
arXiv Detail & Related papers (2020-06-10T02:47:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.