Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving
- URL: http://arxiv.org/abs/2312.05385v2
- Date: Mon, 23 Sep 2024 19:59:42 GMT
- Title: Apparate: Rethinking Early Exits to Tame Latency-Throughput Tensions in ML Serving
- Authors: Yinwei Dai, Rui Pan, Anand Iyer, Kai Li, Ravi Netravali,
- Abstract summary: We present Apparate, a system that automatically applies and manages early exits in machine learning models.
To cope with the time-varying overhead and accuracy challenges that EEs bring, Apparate repurposes exits to provide continual feedback.
Apparate lowers median response latencies by 40.5--91.5% and 10.0--24.2% for diverse CV and NLP classification workloads.
- Score: 10.926767319124547
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Machine learning (ML) inference platforms are tasked with balancing two competing goals: ensuring high throughput given many requests, and delivering low-latency responses to support interactive applications. Unfortunately, existing platform knobs (e.g., batch sizes) fail to ease this fundamental tension, and instead only enable users to harshly trade off one property for the other. This paper explores an alternate strategy to taming throughput-latency tradeoffs by changing the granularity at which inference is performed. We present Apparate, a system that automatically applies and manages early exits (EEs) in ML models, whereby certain inputs can exit with results at intermediate layers. To cope with the time-varying overhead and accuracy challenges that EEs bring, Apparate repurposes exits to provide continual feedback that powers several novel runtime monitoring and adaptation strategies. Apparate lowers median response latencies by 40.5--91.5% and 10.0--24.2% for diverse CV and NLP classification workloads, and median time-per-token latencies by 22.6--77.9% for generative scenarios, without affecting throughputs or violating tight accuracy constraints.
Related papers
- SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios.
In the early route, intermediate outputs are consolidated via an anti-redundancy operation.
In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z) - FFN-SkipLLM: A Hidden Gem for Autoregressive Decoding with Adaptive Feed Forward Skipping [49.66872823080736]
Autoregressive Large Language Models (e.g., LLaMa, GPTs) are omnipresent achieving remarkable success in language understanding and generation.
To mitigate overload incurred during generation, several early-exit and layer-dropping strategies have been proposed.
We propose FFN-SkipLLM, which is an input-adaptive feed-forward skipping strategy.
arXiv Detail & Related papers (2024-04-05T02:35:43Z) - Not All Attention is Needed: Parameter and Computation Efficient Transfer Learning for Multi-modal Large Language Models [73.48675708831328]
We propose a novel parameter and computation efficient tuning method for Multi-modal Large Language Models (MLLMs)
The Efficient Attention Skipping (EAS) method evaluates the attention redundancy and skips the less important MHAs to speed up inference.
The experiments show that EAS not only retains high performance and parameter efficiency, but also greatly speeds up inference speed.
arXiv Detail & Related papers (2024-03-22T14:20:34Z) - ALTO: An Efficient Network Orchestrator for Compound AI Systems [20.880866765513066]
ALTO is a network orchestrator for efficiently serving compound AI systems such as pipelines of language models.
As language models produce outputs token by token, ALTO exposes opportunities to stream intermediate outputs between stages when possible.
We highlight two new challenges of correctness and load balancing which emerge when streaming intermediate data across distributed pipeline stage instances.
arXiv Detail & Related papers (2024-03-07T08:30:26Z) - Approximated Prompt Tuning for Vision-Language Pre-trained Models [54.326232586461614]
In vision-language pre-trained models, prompt tuning often requires a large number of learnable tokens to bridge the gap between the pre-training and downstream tasks.
We propose a novel Approximated Prompt Tuning (APT) approach towards efficient VL transfer learning.
arXiv Detail & Related papers (2023-06-27T05:43:47Z) - DeepFT: Fault-Tolerant Edge Computing using a Self-Supervised Deep
Surrogate Model [12.335763358698564]
We propose DeepFT to proactively avoid system overloads and their adverse effects.
DeepFT uses a deep surrogate model to accurately predict and diagnose faults in the system.
It offers a highly scalable solution as the model size scales by only 3 and 1 percent per unit increase in the number of active tasks and hosts.
arXiv Detail & Related papers (2022-12-02T16:51:58Z) - COST-EFF: Collaborative Optimization of Spatial and Temporal Efficiency
with Slenderized Multi-exit Language Models [16.586312156966635]
Transformer-based pre-trained language models (PLMs) mostly suffer from excessive overhead despite their advanced capacity.
Existing statically compressed models are unaware of the diverse complexities between input instances.
We propose a collaborative optimization for PLMs that integrates static model compression and dynamic inference acceleration.
arXiv Detail & Related papers (2022-10-27T15:06:40Z) - Follow-the-Perturbed-Leader for Adversarial Markov Decision Processes
with Bandit Feedback [35.687473978249535]
We consider regret for Adversarial Markov Decision Processes (AMDPs) where the loss functions are changing over time and adversarially chosen.
While there has been a surge of studies on this problem using Online-Mirror-Descent (OMD) methods, very little is known about the Follow-the-Perturbed-Leader (FTPL) methods.
We develop the first no-regret algorithm for learning AMDPs in the infinite-horizon setting with bandit feedback and transitions.
arXiv Detail & Related papers (2022-05-26T15:55:50Z) - Fast Continuous and Integer L-shaped Heuristics Through Supervised
Learning [4.521119623956821]
We propose a methodology to accelerate the solution of mixed-integer linear two-stage programs.
We aim at solving problems where the second stage is highly demanding.
Our core idea is to gain large reductions in online solution time while incurring small reductions in first-stage solution accuracy.
arXiv Detail & Related papers (2022-05-02T13:15:32Z) - Low-Latency Federated Learning over Wireless Channels with Differential
Privacy [142.5983499872664]
In federated learning (FL), model training is distributed over clients and local models are aggregated by a central server.
In this paper, we aim to minimize FL training delay over wireless channels, constrained by overall training performance as well as each client's differential privacy (DP) requirement.
arXiv Detail & Related papers (2021-06-20T13:51:18Z) - Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples.
We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries.
We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.