Dynamic Vocabulary Pruning in Early-Exit LLMs
- URL: http://arxiv.org/abs/2410.18952v2
- Date: Wed, 30 Oct 2024 15:28:02 GMT
- Title: Dynamic Vocabulary Pruning in Early-Exit LLMs
- Authors: Jort Vincenti, Karim Abdel Sadek, Joan Velja, Matteo Nulli, Metod Jazbec,
- Abstract summary: Increasing the size of large language models (LLMs) has been shown to lead to better performance.
This comes at the cost of slower and more expensive inference.
We propose dynamically pruning the vocabulary at test time for each token.
- Score: 0.11983702508388193
- License:
- Abstract: Increasing the size of large language models (LLMs) has been shown to lead to better performance. However, this comes at the cost of slower and more expensive inference. Early-exiting is a promising approach for improving the efficiency of LLM inference by enabling next token prediction at intermediate layers. Yet, the large vocabulary size in modern LLMs makes the confidence estimation required for exit decisions computationally expensive, diminishing the efficiency gains. To address this, we propose dynamically pruning the vocabulary at test time for each token. Specifically, the vocabulary is pruned at one of the initial layers, and the smaller vocabulary is then used throughout the rest of the forward pass. Our experiments demonstrate that such post-hoc dynamic vocabulary pruning improves the efficiency of confidence estimation in early-exit LLMs while maintaining competitive performance.
Related papers
- Rethinking Semantic Parsing for Large Language Models: Enhancing LLM Performance with Semantic Hints [20.844061807562436]
We propose SENSE, a novel prompting approach that embeds semantic hints within the prompt.
Experiments show that SENSE consistently improves LLMs' performance across various tasks.
arXiv Detail & Related papers (2024-09-22T14:35:09Z) - An Efficient Inference Framework for Early-exit Large Language Models [5.048467183620882]
Early-exit models improve the inference efficiency of LLMs by skipping rest layers and directly generate output tokens when confident enough.
There is no work of LLM inference framework that takes early-exit models into consideration.
We solve two key challenges in building efficient inference framework for early-exit models: (1) batch inference at iteration-level granularity; and (2) KV cache management.
arXiv Detail & Related papers (2024-07-25T07:50:17Z) - Large Vocabulary Size Improves Large Language Models [28.83786065307658]
We investigate the relationship between subword vocabulary size and the performance of large language models (LLMs)
Experimental results show that larger vocabulary sizes lead to better performance in LLMs.
We introduce a simple method to use a new vocabulary instead of the pre-defined one.
arXiv Detail & Related papers (2024-06-24T10:27:07Z) - Exploring Design Choices for Building Language-Specific LLMs [36.32622880071991]
We study building language-specific language models by adapting monolingual and multilingual models.
We find that the initial performance of LLM does not always correlate with the final performance after the adaptation.
arXiv Detail & Related papers (2024-06-20T18:47:43Z) - The Strong Pull of Prior Knowledge in Large Language Models and Its Impact on Emotion Recognition [74.04775677110179]
In-context Learning (ICL) has emerged as a powerful paradigm for performing natural language tasks with Large Language Models (LLM)
We show that LLMs have strong yet inconsistent priors in emotion recognition that ossify their predictions.
Our results suggest that caution is needed when using ICL with larger LLMs for affect-centered tasks outside their pre-training domain.
arXiv Detail & Related papers (2024-03-25T19:07:32Z) - Accelerating LLaMA Inference by Enabling Intermediate Layer Decoding via
Instruction Tuning with LITE [62.13435256279566]
Large Language Models (LLMs) have achieved remarkable performance across a wide variety of natural language tasks.
However, their large size makes their inference slow and computationally expensive.
We show that it enables these layers to acquire 'good' generation ability without affecting the generation ability of the final layer.
arXiv Detail & Related papers (2023-10-28T04:07:58Z) - GrowLength: Accelerating LLMs Pretraining by Progressively Growing
Training Length [65.24730341801468]
This paper introduces a novel, simple, and effective method named growlength'' to accelerate the pretraining process of Large Language Models.
Our method progressively increases the training length throughout the pretraining phase, thereby mitigating computational costs and enhancing efficiency.
arXiv Detail & Related papers (2023-10-01T05:25:24Z) - Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools.
Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions.
Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z) - Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep.
We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z) - Better Language Model with Hypernym Class Prediction [101.8517004687825]
Class-based language models (LMs) have been long devised to address context sparsity in $n$-gram LMs.
In this study, we revisit this approach in the context of neural LMs.
arXiv Detail & Related papers (2022-03-21T01:16:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.