Accelerating LLaMA Inference by Enabling Intermediate Layer Decoding via
Instruction Tuning with LITE
- URL: http://arxiv.org/abs/2310.18581v2
- Date: Tue, 7 Nov 2023 05:44:17 GMT
- Title: Accelerating LLaMA Inference by Enabling Intermediate Layer Decoding via
Instruction Tuning with LITE
- Authors: Neeraj Varshney, Agneet Chatterjee, Mihir Parmar, Chitta Baral
- Abstract summary: Large Language Models (LLMs) have achieved remarkable performance across a wide variety of natural language tasks.
However, their large size makes their inference slow and computationally expensive.
We show that it enables these layers to acquire 'good' generation ability without affecting the generation ability of the final layer.
- Score: 62.13435256279566
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large Language Models (LLMs) have achieved remarkable performance across a
wide variety of natural language tasks; however, their large size makes their
inference slow and computationally expensive. Focusing on this problem, we
propose to instruction tune LLMs with additional explicit losses from the
intermediate layers (LITE) and show that it enables these layers to acquire
'good' generation ability without affecting the generation ability of the final
layer. We perform 'dynamic confidence-based early exiting' at token level from
the intermediate layers which improves the efficiency of text generation
without compromising the quality of the generation. We conduct comprehensive
experiments by instruction tuning LLaMA-2 models on the Alpaca dataset and
holistically evaluate on four different human-instruction test sets. We show
that dynamic early exiting achieves consistent and considerable inference
computation cost improvements (37.86% for 7B and 46.35% for 13B model) while
maintaining the generation quality of the responses. We further conduct a
thorough analysis of the results over several important aspects, such as
comparing the semantic similarity of the outputs and dissecting the efficiency
improvements by comparing the number of tokens generated in the output. In
summary, our work contributes to improving the efficiency of LLM inference
while maintaining the generation quality, a crucial step en route to enabling
their widespread adoption.
Related papers
- An Early FIRST Reproduction and Improvements to Single-Token Decoding for Fast Listwise Reranking [50.81324768683995]
FIRST is a novel approach that integrates a learning-to-rank objective and leveraging the logits of only the first generated token.
We extend the evaluation of FIRST to the TREC Deep Learning datasets (DL19-22), validating its robustness across diverse domains.
Our experiments confirm that fast reranking with single-token logits does not compromise out-of-domain reranking quality.
arXiv Detail & Related papers (2024-11-08T12:08:17Z) - SLED: Self Logits Evolution Decoding for Improving Factuality in Large Language Models [34.3296459569307]
Large language models (LLMs) have demonstrated remarkable capabilities, but their outputs can sometimes be unreliable or factually incorrect.
We introduce Self Logits Evolution Decoding (SLED), a novel decoding framework that enhances the truthfulness of LLMs.
We show that SLED consistently improves factual accuracy by up to 20% compared to existing decoding methods.
arXiv Detail & Related papers (2024-11-01T17:33:34Z) - Strategic Optimization and Challenges of Large Language Models in Object-Oriented Programming [0.0]
This research focuses on method-level code generation within the Object-Oriented Programming (OOP) framework.
We devised experiments that varied the extent of contextual information in the prompts.
Our findings indicate that prompts enriched with method invocation details yield the highest cost-effectiveness.
arXiv Detail & Related papers (2024-08-27T07:44:16Z) - FactorLLM: Factorizing Knowledge via Mixture of Experts for Large Language Models [50.331708897857574]
We introduce FactorLLM, a novel approach that decomposes well-trained dense FFNs into sparse sub-networks without requiring any further modifications.
FactorLLM achieves comparable performance to the source model securing up to 85% model performance while obtaining over a 30% increase in inference speed.
arXiv Detail & Related papers (2024-08-15T16:45:16Z) - Enhancing Reinforcement Learning with Label-Sensitive Reward for Natural Language Understanding [11.470005425117371]
We propose a novel Reinforcement Learning framework enhanced with Label-sensitive Reward (RLLR)
Our method aims to adeptly capture nuanced label-sensitive semantic features during RL, thereby enhancing natural language understanding.
Experiments conducted on five diverse foundation models across eight tasks showcase promising results.
arXiv Detail & Related papers (2024-05-30T07:19:31Z) - Prompt Perturbation Consistency Learning for Robust Language Models [47.021022978847036]
Large language models (LLMs) have demonstrated impressive performance on a number of natural language processing tasks.
We show that fine-tuning sufficiently large LLMs can produce IC-SF performance comparable to discriminative models.
We propose an efficient mitigation approach, Prompt Perturbation Consistency Learning (PPCL), which works by regularizing the divergence between losses from clean and perturbed samples.
arXiv Detail & Related papers (2024-02-24T15:00:58Z) - DoLa: Decoding by Contrasting Layers Improves Factuality in Large
Language Models [79.01926242857613]
Large language models (LLMs) are prone to hallucinations, generating content that deviates from facts seen during pretraining.
We propose a simple decoding strategy for reducing hallucinations with pretrained LLMs.
We find that this Decoding by Contrasting Layers (DoLa) approach is able to better surface factual knowledge and reduce the generation of incorrect facts.
arXiv Detail & Related papers (2023-09-07T17:45:31Z) - Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis,
and LLMs Evaluations [111.88727295707454]
This paper reexamines the research on out-of-distribution (OOD) robustness in the field of NLP.
We propose a benchmark construction protocol that ensures clear differentiation and challenging distribution shifts.
We conduct experiments on pre-trained language models for analysis and evaluation of OOD robustness.
arXiv Detail & Related papers (2023-06-07T17:47:03Z) - Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep.
We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.