HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM
Inference
- URL: http://arxiv.org/abs/2402.09360v1
- Date: Wed, 14 Feb 2024 18:04:36 GMT
- Title: HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM
Inference
- Authors: Yashas Samaga B L and Varun Yerram and Chong You and Srinadh
Bhojanapalli and Sanjiv Kumar and Prateek Jain and Praneeth Netrapalli
- Abstract summary: HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator.
We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
- Score: 68.59839755875252
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Autoregressive decoding with generative Large Language Models (LLMs) on
accelerators (GPUs/TPUs) is often memory-bound where most of the time is spent
on transferring model parameters from high bandwidth memory (HBM) to cache. On
the other hand, recent works show that LLMs can maintain quality with
significant sparsity/redundancy in the feedforward (FFN) layers by
appropriately training the model to operate on a top-$k$ fraction of
rows/columns (where $k \approx 0.05$), there by suggesting a way to reduce the
transfer of model parameters, and hence latency. However, exploiting this
sparsity for improving latency is hindered by the fact that identifying top
rows/columns is data-dependent and is usually performed using full matrix
operations, severely limiting potential gains. To address these issues, we
introduce HiRE (High Recall Approximate Top-k Estimation). HiRE comprises of
two novel components: (i) a compression scheme to cheaply predict top-$k$
rows/columns with high recall, followed by full computation restricted to the
predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate
top-$k$ operator. We demonstrate that on a one billion parameter model, HiRE
applied to both the softmax as well as feedforward layers, achieves almost
matching pretraining and downstream accuracy, and speeds up inference latency
by $1.47\times$ on a single TPUv5e device.
Related papers
- SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios.
In the early route, intermediate outputs are consolidated via an anti-redundancy operation.
In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z) - No Need to Look Back: An Efficient and Scalable Approach for Temporal
Network Representation Learning [9.218415145210715]
This paper introduces a novel efficient TGRL framework, No-Looking-Back (NLB)
NLB employs a "forward recent sampling" strategy, which bypasses the need for backtracking historical interactions.
Empirical evaluations demonstrate that NLB matches or surpasses state-of-the-art methods in accuracy for link prediction and node classification.
arXiv Detail & Related papers (2024-02-03T00:12:36Z) - Scaling Sparse Fine-Tuning to Large Language Models [67.59697720719672]
Large Language Models (LLMs) are difficult to fully fine-tune due to their sheer number of parameters.
We propose SpIEL, a novel sparse finetuning method which maintains an array of parameter indices and the deltas of these parameters relative to their pretrained values.
We show that SpIEL is superior to popular parameter-efficient fine-tuning methods like LoRA in terms of performance and comparable in terms of run time.
arXiv Detail & Related papers (2024-01-29T18:43:49Z) - On Optimal Caching and Model Multiplexing for Large Model Inference [66.50550915522551]
Large Language Models (LLMs) and other large foundation models have achieved noteworthy success, but their size exacerbates existing resource consumption and latency challenges.
We study two approaches for mitigating these challenges: employing a cache to store previous queries and learning a model multiplexer to choose from an ensemble of models for query processing.
arXiv Detail & Related papers (2023-06-03T05:01:51Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of
Language Model [92.55145016562867]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - QuaLA-MiniLM: a Quantized Length Adaptive MiniLM [5.36703735486629]
Limited computational budgets often prevent transformers from being used in production and from having their high accuracy utilized.
A knowledge distillation approach addresses the computational efficiency by self-distilling BERT into a smaller transformer representation having fewer layers and smaller internal embedding.
Dynamic-TinyBERT tackles both limitations by partially implementing the Length Adaptive Transformer (LAT) technique onto TinyBERT, achieving x3 speedup over BERT-base with minimal accuracy loss.
We use MiniLM distillation jointly with the LAT method, and we further enhance the efficiency by applying low-bit quantization.
arXiv Detail & Related papers (2022-10-31T07:42:52Z) - BERMo: What can BERT learn from ELMo? [6.417011237981518]
We use linear combination scheme proposed in Embeddings from Language Models (ELMo) to combine the scaled internal representations from different network depths.
Our approach has two-fold benefits: (1) improved gradient flow for the downstream task and (2) increased representative power.
arXiv Detail & Related papers (2021-10-18T17:35:41Z) - A Learning-Based Fast Uplink Grant for Massive IoT via Support Vector
Machines and Long Short-Term Memory [8.864453148536057]
3IoT introduced the need to use fast uplink grant (FUG) allocation in order to reduce latency and increase reliability for smart internet-of-things (mMTC) applications.
We propose a novel FUG allocation based on support machine scheduler (SVM)
Second, LSTM architecture is used for traffic prediction and correction techniques to overcome prediction errors.
arXiv Detail & Related papers (2021-08-02T11:33:02Z) - CPM-2: Large-scale Cost-effective Pre-trained Language Models [71.59893315671997]
We present a suite of cost-effective techniques for the use of PLMs to deal with the efficiency issues of pre-training, fine-tuning, and inference.
We introduce knowledge inheritance to accelerate the pre-training process by exploiting existing PLMs instead of training models from scratch.
We implement a new inference toolkit, namely InfMoE, for using large-scale PLMs with limited computational resources.
arXiv Detail & Related papers (2021-06-20T15:43:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.