Accelerating LLaMA Inference by Enabling Intermediate Layer Decoding via
Instruction Tuning with LITE
- URL: http://arxiv.org/abs/2310.18581v2
- Date: Tue, 7 Nov 2023 05:44:17 GMT
- Title: Accelerating LLaMA Inference by Enabling Intermediate Layer Decoding via
Instruction Tuning with LITE
- Authors: Neeraj Varshney, Agneet Chatterjee, Mihir Parmar, Chitta Baral
- Abstract summary: Large Language Models (LLMs) have achieved remarkable performance across a wide variety of natural language tasks.
However, their large size makes their inference slow and computationally expensive.
We show that it enables these layers to acquire 'good' generation ability without affecting the generation ability of the final layer.
- Score: 62.13435256279566
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large Language Models (LLMs) have achieved remarkable performance across a
wide variety of natural language tasks; however, their large size makes their
inference slow and computationally expensive. Focusing on this problem, we
propose to instruction tune LLMs with additional explicit losses from the
intermediate layers (LITE) and show that it enables these layers to acquire
'good' generation ability without affecting the generation ability of the final
layer. We perform 'dynamic confidence-based early exiting' at token level from
the intermediate layers which improves the efficiency of text generation
without compromising the quality of the generation. We conduct comprehensive
experiments by instruction tuning LLaMA-2 models on the Alpaca dataset and
holistically evaluate on four different human-instruction test sets. We show
that dynamic early exiting achieves consistent and considerable inference
computation cost improvements (37.86% for 7B and 46.35% for 13B model) while
maintaining the generation quality of the responses. We further conduct a
thorough analysis of the results over several important aspects, such as
comparing the semantic similarity of the outputs and dissecting the efficiency
improvements by comparing the number of tokens generated in the output. In
summary, our work contributes to improving the efficiency of LLM inference
while maintaining the generation quality, a crucial step en route to enabling
their widespread adoption.
Related papers
- Leveraging the true depth of LLMs [46.81174316936993]
Large Language Models demonstrate remarkable capabilities at the cost of high compute requirements.
We investigate several potential ways to reduce the depth of pre-trained LLMs without significantly affecting performance.
We present a novel approach that exploits this decoupling between layers by grouping some of them into pairs that can be evaluated in parallel.
arXiv Detail & Related papers (2025-02-05T00:26:27Z) - Clear Minds Think Alike: What Makes LLM Fine-tuning Robust? A Study of Token Perplexity [61.48338027901318]
We show that fine-tuning with LLM-generated data improves target task performance and reduces out-of-domain degradation.
This is the first mechanistic explanation for the superior OOD robustness conferred by LLM-generated training data.
arXiv Detail & Related papers (2025-01-24T08:18:56Z) - Adaptive Pruning for Large Language Models with Structural Importance Awareness [66.2690963378878]
Large language models (LLMs) have significantly improved language understanding and generation capabilities.
LLMs are difficult to deploy on resource-constrained edge devices due to their high computational and storage resource demands.
We propose structurally-aware adaptive pruning (SAAP) to significantly reduce the computational and memory costs while maintaining model performance.
arXiv Detail & Related papers (2024-12-19T18:08:04Z) - Strategic Optimization and Challenges of Large Language Models in Object-Oriented Programming [0.0]
This research focuses on method-level code generation within the Object-Oriented Programming (OOP) framework.
We devised experiments that varied the extent of contextual information in the prompts.
Our findings indicate that prompts enriched with method invocation details yield the highest cost-effectiveness.
arXiv Detail & Related papers (2024-08-27T07:44:16Z) - FactorLLM: Factorizing Knowledge via Mixture of Experts for Large Language Models [50.331708897857574]
We introduce FactorLLM, a novel approach that decomposes well-trained dense FFNs into sparse sub-networks without requiring any further modifications.
FactorLLM achieves comparable performance to the source model securing up to 85% model performance while obtaining over a 30% increase in inference speed.
arXiv Detail & Related papers (2024-08-15T16:45:16Z) - Enhancing Reinforcement Learning with Label-Sensitive Reward for Natural Language Understanding [11.470005425117371]
We propose a novel Reinforcement Learning framework enhanced with Label-sensitive Reward (RLLR)
Our method aims to adeptly capture nuanced label-sensitive semantic features during RL, thereby enhancing natural language understanding.
Experiments conducted on five diverse foundation models across eight tasks showcase promising results.
arXiv Detail & Related papers (2024-05-30T07:19:31Z) - Prompt Perturbation Consistency Learning for Robust Language Models [47.021022978847036]
Large language models (LLMs) have demonstrated impressive performance on a number of natural language processing tasks.
We show that fine-tuning sufficiently large LLMs can produce IC-SF performance comparable to discriminative models.
We propose an efficient mitigation approach, Prompt Perturbation Consistency Learning (PPCL), which works by regularizing the divergence between losses from clean and perturbed samples.
arXiv Detail & Related papers (2024-02-24T15:00:58Z) - Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep.
We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.