Efficient NLP Inference at the Edge via Elastic Pipelining
- URL: http://arxiv.org/abs/2207.05022v2
- Date: Tue, 12 Jul 2022 03:17:06 GMT
- Title: Efficient NLP Inference at the Edge via Elastic Pipelining
- Authors: Liwei Guo, Wonkyo Choe, Felix Xiaozhu Lin
- Abstract summary: WRX reconciles the latency/memory tension via two novel techniques.
We build WRX and evaluate it against a range of NLP tasks, under a practical range of target latencies, and on both CPU and GPU.
- Score: 0.42970700836450487
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Natural Language Processing (NLP) inference is seeing increasing adoption by
mobile applications, where on-device inference is desirable for crucially
preserving user data privacy and avoiding network roundtrips. Yet, the
unprecedented size of an NLP model stresses both latency and memory, the two
key resources of a mobile device. To meet a target latency, holding the whole
model in memory launches execution as soon as possible but increases one app's
memory footprints by several times, limiting its benefits to only a few
inferences before being recycled by mobile memory management. On the other
hand, loading the model from storage on demand incurs a few seconds long IO,
far exceeding the delay range satisfying to a user; pipelining layerwise model
loading and execution does not hide IO either, due to the large skewness
between IO and computation delays.
To this end, we propose WRX. Built on the key idea of maximizing IO/compute
resource utilization on the most important parts of a model, WRX reconciles the
latency/memory tension via two novel techniques. First, model sharding. WRX
manages model parameters as independently tunable shards and profiles their
importance to accuracy. Second, elastic pipeline planning with a preload
buffer. WRX instantiates an IO/computation pipeline and uses a small buffer for
preload shards to bootstrap execution without stalling in early stages; it
judiciously selects, tunes, and assembles shards per their importance for
resource-elastic execution, which maximizes inference accuracy.
Atop two commodity SoCs, we build WRX and evaluate it against a wide range of
NLP tasks, under a practical range of target latencies, and on both CPU and
GPU. We demonstrate that, WRX delivers high accuracies with 1--2 orders of
magnitude lower memory, outperforming competitive baselines.
Related papers
- TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices [36.714057078457195]
We present TPI-LLM, a compute- and memory-efficient tensor parallel inference system for 70B-scale models.
TPI-LLM keeps sensitive raw data local in the users' devices and introduces a sliding window memory scheduler.
We show that TPI-LLM demonstrated over 80% less time-to-first-token and token latency compared to Accelerate.
arXiv Detail & Related papers (2024-10-01T09:18:56Z) - Enabling Efficient On-Device Fine-Tuning of LLMs Using Only Inference Engines [17.539008562641303]
Large Language Models (LLMs) are currently pre-trained and fine-tuned on large cloud servers.
Next frontier is LLM personalization, where a foundation model can be fine-tuned with user/task-specific data.
Fine-tuning on resource-constrained edge devices presents significant challenges due to substantial memory and computational demands.
arXiv Detail & Related papers (2024-09-23T20:14:09Z) - Combining Relevance and Magnitude for Resource-Aware DNN Pruning [16.976723041143956]
Pruning neural networks, removing some of their parameters whilst retaining their accuracy, is one of the main ways to reduce the latency of a machine learning pipeline.
In this paper, we propose a novel pruning approach, called FlexRel, predicated upon combining training-time and inference-time information.
Our performance evaluation shows that FlexRel is able to achieve higher pruning factors, saving over 35% bandwidth for typical accuracy targets.
arXiv Detail & Related papers (2024-05-21T11:42:15Z) - NeuraChip: Accelerating GNN Computations with a Hash-based Decoupled Spatial Accelerator [3.926150707772004]
We introduce NeuraChip, a novel GNN spatial accelerator based on Gustavson's algorithm.
NeuraChip decouples the multiplication and addition computations in sparse matrix multiplication.
We also present NeuraSim, an open-source, cycle-accurate, multi-threaded, modular simulator for comprehensive performance analysis.
arXiv Detail & Related papers (2024-04-23T20:51:09Z) - Green AI: A Preliminary Empirical Study on Energy Consumption in DL
Models Across Different Runtime Infrastructures [56.200335252600354]
It is common practice to deploy pre-trained models on environments distinct from their native development settings.
This led to the introduction of interchange formats such as ONNX, which includes its infrastructure, and ONNX, which work as standard formats.
arXiv Detail & Related papers (2024-02-21T09:18:44Z) - HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM
Inference [68.59839755875252]
HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator.
We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
arXiv Detail & Related papers (2024-02-14T18:04:36Z) - FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency
Trade-off in Language Model Inference [57.119047493787185]
This paper shows how to reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop.
In practice, our method can reduce model size by 43.1% and bring $1.25sim1.56times$ wall clock time speedup on different hardware with negligible accuracy drop.
arXiv Detail & Related papers (2024-01-08T17:29:16Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z) - Adaptive Subcarrier, Parameter, and Power Allocation for Partitioned
Edge Learning Over Broadband Channels [69.18343801164741]
partitioned edge learning (PARTEL) implements parameter-server training, a well known distributed learning method, in wireless network.
We consider the case of deep neural network (DNN) models which can be trained using PARTEL by introducing some auxiliary variables.
arXiv Detail & Related papers (2020-10-08T15:27:50Z) - Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of
Partitioned Edge Learning [73.82875010696849]
Machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models.
This paper focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation.
arXiv Detail & Related papers (2020-03-10T05:52:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.