Related papers: HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM Inference

HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM Inference

URL: http://arxiv.org/abs/2402.09360v1
Date: Wed, 14 Feb 2024 18:04:36 GMT
Title: HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM Inference
Authors: Yashas Samaga B L and Varun Yerram and Chong You and Srinadh Bhojanapalli and Sanjiv Kumar and Prateek Jain and Praneeth Netrapalli
Abstract summary: HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator. We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
Score: 68.59839755875252
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Autoregressive decoding with generative Large Language Models (LLMs) on accelerators (GPUs/TPUs) is often memory-bound where most of the time is spent on transferring model parameters from high bandwidth memory (HBM) to cache. On the other hand, recent works show that LLMs can maintain quality with significant sparsity/redundancy in the feedforward (FFN) layers by appropriately training the model to operate on a top-$k$ fraction of rows/columns (where $k \approx 0.05$), there by suggesting a way to reduce the transfer of model parameters, and hence latency. However, exploiting this sparsity for improving latency is hindered by the fact that identifying top rows/columns is data-dependent and is usually performed using full matrix operations, severely limiting potential gains. To address these issues, we introduce HiRE (High Recall Approximate Top-k Estimation). HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator. We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47\times$ on a single TPUv5e device.

Related papers

Ravan: Multi-Head Low-Rank Adaptation for Federated Fine-Tuning [16.99490636203893]
We present textscRavan, an adaptive multi-head LoRA method that balances parameter efficiency and model expressivity.<n>Experiments on vision and language benchmarks show that textscRavan improves test accuracy by 2-8% over prior parameter-efficient baselines.
arXiv Detail & Related papers (2025-06-05T20:28:02Z)
Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints [14.341123057506827]
Large Language Models (LLMs) are indispensable in today's applications, but their inference procedure demands significant computational resources. This paper formulates LLM inference optimization as a multi-stage online scheduling problem. We develop a fluid dynamics approximation to provide a tractable benchmark that guides algorithm design.
arXiv Detail & Related papers (2025-04-15T16:00:21Z)
HELIOS: Adaptive Model And Early-Exit Selection for Efficient LLM Inference Serving [5.698111842478072]
Early-Exit LLMs efficiently navigate this trade-off space by skipping some of the later model layers. Current frameworks statically select a model for a user task, limiting our ability to adapt to changing nature of the input queries. We propose HELIOS to address these challenges. First, HELIOS shortlists a set of candidate LLMs, evaluates them using a subset of prompts, gathering telemetry data in real-time. Second, HELIOS uses the early exit data from these evaluations to greedily load the selected model only up to a limited number of layers.
arXiv Detail & Related papers (2025-04-14T21:30:43Z)
LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive. Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones. We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z)
S$^{2}$FT: Efficient, Scalable and Generalizable LLM Fine-tuning by Structured Sparsity [39.679861450783605]
We propose a family of Structured Sparse Fine-Tuning (S$2$FT) methods for LLMs. S$2$FT accomplishes this by "selecting sparsely and computing densely" We show that S$2$FT saves training memory up to 3$times$ and improves latency by 1.5-2.7$times$ compared to full FT.
arXiv Detail & Related papers (2024-12-09T08:24:11Z)
HELENE: Hessian Layer-wise Clipping and Gradient Annealing for Accelerating Fine-tuning LLM with Zeroth-order Optimization [18.00873866263434]
Fine-tuning large language models (LLMs) poses significant memory challenges. Recent work, MeZO, addresses this issue using a zeroth-order (ZO) optimization method. We introduce HELENE, a novel scalable and memory-efficient pre-conditioner.
arXiv Detail & Related papers (2024-11-16T04:27:22Z)
Expanding Sparse Tuning for Low Memory Usage [103.43560327427647]
We propose a method named SNELL (Sparse tuning with kerNELized LoRA) for sparse tuning with low memory usage. To achieve low memory usage, SNELL decomposes the tunable matrix for sparsification into two learnable low-rank matrices. A competition-based sparsification mechanism is further proposed to avoid the storage of tunable weight indexes.
arXiv Detail & Related papers (2024-11-04T04:58:20Z)
LoRTA: Low Rank Tensor Adaptation of Large Language Models [70.32218116940393]
Low Rank Adaptation (LoRA) is a popular Efficient Fine Tuning (PEFT) method. We propose a higher-order Candecomp/Parafac (CP) decomposition, enabling a more compact and flexible representation. Our method can achieve a reduction in the number of parameters while maintaining comparable performance.
arXiv Detail & Related papers (2024-10-05T06:59:50Z)
Enabling Efficient On-Device Fine-Tuning of LLMs Using Only Inference Engines [17.539008562641303]
Large Language Models (LLMs) are currently pre-trained and fine-tuned on large cloud servers. Next frontier is LLM personalization, where a foundation model can be fine-tuned with user/task-specific data. Fine-tuning on resource-constrained edge devices presents significant challenges due to substantial memory and computational demands.
arXiv Detail & Related papers (2024-09-23T20:14:09Z)
SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios. In the early route, intermediate outputs are consolidated via an anti-redundancy operation. In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z)
On Optimal Caching and Model Multiplexing for Large Model Inference [66.50550915522551]
Large Language Models (LLMs) and other large foundation models have achieved noteworthy success, but their size exacerbates existing resource consumption and latency challenges. We study two approaches for mitigating these challenges: employing a cache to store previous queries and learning a model multiplexer to choose from an ensemble of models for query processing.
arXiv Detail & Related papers (2023-06-03T05:01:51Z)
Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance. Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z)
BERMo: What can BERT learn from ELMo? [6.417011237981518]
We use linear combination scheme proposed in Embeddings from Language Models (ELMo) to combine the scaled internal representations from different network depths. Our approach has two-fold benefits: (1) improved gradient flow for the downstream task and (2) increased representative power.
arXiv Detail & Related papers (2021-10-18T17:35:41Z)
A Learning-Based Fast Uplink Grant for Massive IoT via Support Vector Machines and Long Short-Term Memory [8.864453148536057]
3IoT introduced the need to use fast uplink grant (FUG) allocation in order to reduce latency and increase reliability for smart internet-of-things (mMTC) applications. We propose a novel FUG allocation based on support machine scheduler (SVM) Second, LSTM architecture is used for traffic prediction and correction techniques to overcome prediction errors.
arXiv Detail & Related papers (2021-08-02T11:33:02Z)
CPM-2: Large-scale Cost-effective Pre-trained Language Models [71.59893315671997]
We present a suite of cost-effective techniques for the use of PLMs to deal with the efficiency issues of pre-training, fine-tuning, and inference. We introduce knowledge inheritance to accelerate the pre-training process by exploiting existing PLMs instead of training models from scratch. We implement a new inference toolkit, namely InfMoE, for using large-scale PLMs with limited computational resources.
arXiv Detail & Related papers (2021-06-20T15:43:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.