Sorted LLaMA: Unlocking the Potential of Intermediate Layers of Large
Language Models for Dynamic Inference
- URL: http://arxiv.org/abs/2309.08968v2
- Date: Thu, 8 Feb 2024 22:43:04 GMT
- Title: Sorted LLaMA: Unlocking the Potential of Intermediate Layers of Large
Language Models for Dynamic Inference
- Authors: Parsa Kavehzadeh, Mojtaba Valipour, Marzieh Tahaei, Ali Ghodsi, Boxing
Chen, Mehdi Rezagholizadeh
- Abstract summary: We extend SortedNet to generative NLP tasks by replacing Standard Fine-Tuning (SFT) with Sorted Fine-Tuning (SoFT)
Our approach boosts model efficiency, eliminating the need for multiple models for various scenarios during inference.
Our results show the superior performance of sub-models in comparison to Standard Fine-Tuning and SFT+ICT (Early-Exit)
- Score: 32.62084449979531
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large language models (LLMs) have revolutionized natural language processing
(NLP) by excelling at understanding and generating human-like text. However,
their widespread deployment can be prohibitively expensive. SortedNet is a
recent training technique for enabling dynamic inference by leveraging the
modularity in networks and sorting sub-models based on computation/accuracy in
a nested manner. We extend SortedNet to generative NLP tasks, making large
language models dynamic without any Pre-Training and by only replacing Standard
Fine-Tuning (SFT) with Sorted Fine-Tuning (SoFT). Our approach boosts model
efficiency, eliminating the need for multiple models for various scenarios
during inference. We show that this approach can unlock the power of
intermediate layers of transformers in generating the target output. Our
sub-models remain integral components of the original model, minimizing storage
requirements and transition costs between different computational/latency
budgets. The efficacy of our proposed method was demonstrated by applying it to
tune LLaMA 2 13B on the Stanford Alpaca dataset for instruction following and
TriviaQA for closed-book question answering. Our results show the superior
performance of sub-models in comparison to Standard Fine-Tuning and SFT+ICT
(Early-Exit), all achieved with efficient tuning and without additional memory
usage during inference.
Related papers
- Dynamic layer selection in decoder-only transformers [21.18795712840146]
We empirically examine two common dynamic inference methods for natural language generation.
We find that a pre-trained decoder-only model is significantly more robust to layer removal via layer skipping.
We also show that dynamic computation allocation on a per-sequence basis holds promise for significant efficiency gains.
arXiv Detail & Related papers (2024-10-26T00:44:11Z) - COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement [80.18490952057125]
Iterative refinement has emerged as an effective paradigm for enhancing the capabilities of large language models (LLMs) on complex tasks.
We propose Context-Wise Order-Agnostic Language Modeling (COrAL) to overcome these challenges.
Our approach models multiple token dependencies within manageable context windows, enabling the model to perform iterative refinement internally.
arXiv Detail & Related papers (2024-10-12T23:56:19Z) - Unlocking the Potential of Model Merging for Low-Resource Languages [66.7716891808697]
Adapting large language models to new languages typically involves continual pre-training (CT) followed by supervised fine-tuning (SFT)
We propose model merging as an alternative for low-resource languages, combining models with distinct capabilities into a single model without additional training.
Experiments based on Llama-2-7B demonstrate that model merging effectively endows LLMs for low-resource languages with task-solving abilities, outperforming CT-then-SFT in scenarios with extremely scarce data.
arXiv Detail & Related papers (2024-07-04T15:14:17Z) - Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models [63.36637269634553]
We present a novel method of further improving performance by requiring models to compare multiple reasoning chains.
We find that instruction tuning on DCoT datasets boosts the performance of even smaller, and therefore more accessible, language models.
arXiv Detail & Related papers (2024-07-03T15:01:18Z) - SortedNet: A Scalable and Generalized Framework for Training Modular Deep Neural Networks [30.069353400127046]
We propose SortedNet to harness the inherent modularity of deep neural networks (DNNs)
SortedNet enables the training of sub-models simultaneously along with the training of the main model.
It is able to train 160 sub-models at once, achieving at least 96% of the original model's performance.
arXiv Detail & Related papers (2023-09-01T05:12:25Z) - Cheaply Evaluating Inference Efficiency Metrics for Autoregressive
Transformer APIs [66.30706841821123]
Large language models (LLMs) power many state-of-the-art systems in natural language processing.
LLMs are extremely computationally expensive, even at inference time.
We propose a new metric for comparing inference efficiency across models.
arXiv Detail & Related papers (2023-05-03T21:51:42Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - Revisiting Simple Neural Probabilistic Language Models [27.957834093475686]
This paper revisits the neural probabilistic language model (NPLM) ofcitetBengio2003ANP.
When scaled up to modern hardware, this model performs much better than expected on word-level language model benchmarks.
Inspired by this result, we modify the Transformer by replacing its first self-attention layer with the NPLM's local concatenation layer.
arXiv Detail & Related papers (2021-04-08T02:18:47Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - Rethinking embedding coupling in pre-trained language models [46.11201932668366]
We re-evaluate the standard practice of sharing weights between input and output embeddings in pre-trained language models.
We show that decoupled embeddings provide increased modeling flexibility, allowing us to significantly improve the efficiency of parameter allocation.
We are able to train models that achieve strong performance on the XTREME benchmark without increasing the number of parameters at the fine-tuning stage.
arXiv Detail & Related papers (2020-10-24T07:43:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.