HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM
Inference
- URL: http://arxiv.org/abs/2402.09360v1
- Date: Wed, 14 Feb 2024 18:04:36 GMT
- Title: HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM
Inference
- Authors: Yashas Samaga B L and Varun Yerram and Chong You and Srinadh
Bhojanapalli and Sanjiv Kumar and Prateek Jain and Praneeth Netrapalli
- Abstract summary: HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator.
We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
- Score: 68.59839755875252
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Autoregressive decoding with generative Large Language Models (LLMs) on
accelerators (GPUs/TPUs) is often memory-bound where most of the time is spent
on transferring model parameters from high bandwidth memory (HBM) to cache. On
the other hand, recent works show that LLMs can maintain quality with
significant sparsity/redundancy in the feedforward (FFN) layers by
appropriately training the model to operate on a top-$k$ fraction of
rows/columns (where $k \approx 0.05$), there by suggesting a way to reduce the
transfer of model parameters, and hence latency. However, exploiting this
sparsity for improving latency is hindered by the fact that identifying top
rows/columns is data-dependent and is usually performed using full matrix
operations, severely limiting potential gains. To address these issues, we
introduce HiRE (High Recall Approximate Top-k Estimation). HiRE comprises of
two novel components: (i) a compression scheme to cheaply predict top-$k$
rows/columns with high recall, followed by full computation restricted to the
predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate
top-$k$ operator. We demonstrate that on a one billion parameter model, HiRE
applied to both the softmax as well as feedforward layers, achieves almost
matching pretraining and downstream accuracy, and speeds up inference latency
by $1.47\times$ on a single TPUv5e device.
Related papers
- HELENE: Hessian Layer-wise Clipping and Gradient Annealing for Accelerating Fine-tuning LLM with Zeroth-order Optimization [18.00873866263434]
Fine-tuning large language models (LLMs) poses significant memory challenges.
Recent work, MeZO, addresses this issue using a zeroth-order (ZO) optimization method.
We introduce HELENE, a novel scalable and memory-efficient pre-conditioner.
arXiv Detail & Related papers (2024-11-16T04:27:22Z) - Expanding Sparse Tuning for Low Memory Usage [103.43560327427647]
We propose a method named SNELL (Sparse tuning with kerNELized LoRA) for sparse tuning with low memory usage.
To achieve low memory usage, SNELL decomposes the tunable matrix for sparsification into two learnable low-rank matrices.
A competition-based sparsification mechanism is further proposed to avoid the storage of tunable weight indexes.
arXiv Detail & Related papers (2024-11-04T04:58:20Z) - Enabling Efficient On-Device Fine-Tuning of LLMs Using Only Inference Engines [17.539008562641303]
Large Language Models (LLMs) are currently pre-trained and fine-tuned on large cloud servers.
Next frontier is LLM personalization, where a foundation model can be fine-tuned with user/task-specific data.
Fine-tuning on resource-constrained edge devices presents significant challenges due to substantial memory and computational demands.
arXiv Detail & Related papers (2024-09-23T20:14:09Z) - SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios.
In the early route, intermediate outputs are consolidated via an anti-redundancy operation.
In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z) - On Optimal Caching and Model Multiplexing for Large Model Inference [66.50550915522551]
Large Language Models (LLMs) and other large foundation models have achieved noteworthy success, but their size exacerbates existing resource consumption and latency challenges.
We study two approaches for mitigating these challenges: employing a cache to store previous queries and learning a model multiplexer to choose from an ensemble of models for query processing.
arXiv Detail & Related papers (2023-06-03T05:01:51Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - BERMo: What can BERT learn from ELMo? [6.417011237981518]
We use linear combination scheme proposed in Embeddings from Language Models (ELMo) to combine the scaled internal representations from different network depths.
Our approach has two-fold benefits: (1) improved gradient flow for the downstream task and (2) increased representative power.
arXiv Detail & Related papers (2021-10-18T17:35:41Z) - A Learning-Based Fast Uplink Grant for Massive IoT via Support Vector
Machines and Long Short-Term Memory [8.864453148536057]
3IoT introduced the need to use fast uplink grant (FUG) allocation in order to reduce latency and increase reliability for smart internet-of-things (mMTC) applications.
We propose a novel FUG allocation based on support machine scheduler (SVM)
Second, LSTM architecture is used for traffic prediction and correction techniques to overcome prediction errors.
arXiv Detail & Related papers (2021-08-02T11:33:02Z) - CPM-2: Large-scale Cost-effective Pre-trained Language Models [71.59893315671997]
We present a suite of cost-effective techniques for the use of PLMs to deal with the efficiency issues of pre-training, fine-tuning, and inference.
We introduce knowledge inheritance to accelerate the pre-training process by exploiting existing PLMs instead of training models from scratch.
We implement a new inference toolkit, namely InfMoE, for using large-scale PLMs with limited computational resources.
arXiv Detail & Related papers (2021-06-20T15:43:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.