Low-Latency ML Inference by Grouping Correlated Data Objects and
Computation
- URL: http://arxiv.org/abs/2312.11488v1
- Date: Thu, 30 Nov 2023 16:02:04 GMT
- Title: Low-Latency ML Inference by Grouping Correlated Data Objects and
Computation
- Authors: Thiago Garrett, Weijia Song, Roman Vitenberg, Ken Birman
- Abstract summary: We propose a novel correlation grouping mechanism that makes it easier for developers to express application-specific data access correlations.
Experiments based on a latency-sensitive ML-based application confirm the limitations of standard techniques.
The proposed mechanism is able to maintain significantly lower and more consistent latency, higher node utilization as workload and scale-out increase.
- Score: 0.20482269513546453
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: ML inference workflows often require low latency and high throughput, yet we
lack good options for addressing this need. Techniques that reduce latency in
other streaming settings (such as caching and optimization-driven scheduling)
are of limited value because ML data dependencies are often very large and can
change dramatically depending on the triggering event. In this work, we propose
a novel correlation grouping mechanism that makes it easier for developers to
express application-specific data access correlations, enabling coordinated
management of data objects in server clusters hosting streaming inference
tasks. Experiments based on a latency-sensitive ML-based application confirm
the limitations of standard techniques while showing that our solution yields
dramatically better performance. The proposed mechanism is able to maintain
significantly lower and more consistent latency, achieves higher node
utilization as workload and scale-out increase, and yet requires only minor
changes to the code implementing the application.
Related papers
- ALISE: Accelerating Large Language Model Serving with Speculative Scheduling [7.367068885621016]
Large Language Models (LLMs) represent a revolutionary advancement in the contemporary landscape of artificial general intelligence (AGI)
In this paper, we propose a new efficient LLM inference serving framework, named ALISE.
We show that ALISE improves the throughput of inference serving by up to 1.8x and 2.1x under the same latency constraint on the Alpaca and ShareGPT datasets, respectively.
arXiv Detail & Related papers (2024-10-31T00:58:11Z) - When Less is More: Achieving Faster Convergence in Distributed Edge Machine Learning [0.0]
Distributed Machine Learning (DML) on resource-constrained edge devices holds immense potential for real-world applications.
This paper proposes Hermes, a novel probabilistic framework for efficient DML on edge devices.
Our evaluations on a real-world heterogeneous resource-constrained environment demonstrate that Hermes achieves faster convergence compared to state-of-the-art methods.
arXiv Detail & Related papers (2024-10-27T16:17:03Z) - Fast Inference for Augmented Large Language Models [14.195265302357148]
Augmented Large Language Models (LLMs) enhance the capabilities of standalone LLMs by integrating external data sources through API calls.
Traditional size-based scheduling algorithms, such as Shortest Job First (SJF), become less effective at minimizing completion times.
We propose LAMPS, a novel LLM inference framework for augmented LLMs.
arXiv Detail & Related papers (2024-10-23T19:53:30Z) - Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models [79.41139393080736]
Large language models (LLMs) have rapidly advanced and demonstrated impressive capabilities.
In-Context Learning (ICL) and.
Efficient Fine-Tuning (PEFT) are currently two mainstream methods for augmenting.
LLMs to downstream tasks.
We propose Reference Trustable Decoding (RTD), a paradigm that allows models to quickly adapt to new tasks without fine-tuning.
arXiv Detail & Related papers (2024-09-30T10:48:20Z) - SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios.
In the early route, intermediate outputs are consolidated via an anti-redundancy operation.
In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z) - SpotServe: Serving Generative Large Language Models on Preemptible
Instances [64.18638174004151]
SpotServe is the first distributed large language models serving system on preemptible instances.
We show that SpotServe can reduce the P99 tail latency by 2.4 - 9.1x compared with the best existing LLM serving systems.
We also show that SpotServe can leverage the price advantage of preemptive instances, saving 54% monetary cost compared with only using on-demand instances.
arXiv Detail & Related papers (2023-11-27T06:31:17Z) - Federated Learning of Large Language Models with Parameter-Efficient
Prompt Tuning and Adaptive Optimization [71.87335804334616]
Federated learning (FL) is a promising paradigm to enable collaborative model training with decentralized data.
The training process of Large Language Models (LLMs) generally incurs the update of significant parameters.
This paper proposes an efficient partial prompt tuning approach to improve performance and efficiency simultaneously.
arXiv Detail & Related papers (2023-10-23T16:37:59Z) - Fast Distributed Inference Serving for Large Language Models [12.703624317418237]
We present FastServe, a distributed inference serving system for large language models (LLMs)
FastServe exploits the autoregressive pattern of LLM inference to enable preemption at the granularity of each output token.
We build a system prototype of FastServe and experimental results show that compared to the state-of-the-art solution vLLM, FastServe improves the throughput by up to 31.4x and 17.9x under the same average and tail latency requirements, respectively.
arXiv Detail & Related papers (2023-05-10T06:17:50Z) - FIRE: A Failure-Adaptive Reinforcement Learning Framework for Edge Computing Migrations [52.85536740465277]
FIRE is a framework that adapts to rare events by training a RL policy in an edge computing digital twin environment.
We propose ImRE, an importance sampling-based Q-learning algorithm, which samples rare events proportionally to their impact on the value function.
We show that FIRE reduces costs compared to vanilla RL and the greedy baseline in the event of failures.
arXiv Detail & Related papers (2022-09-28T19:49:39Z) - OFedQIT: Communication-Efficient Online Federated Learning via
Quantization and Intermittent Transmission [7.6058140480517356]
Online federated learning (OFL) is a promising framework to collaboratively learn a sequence of non-linear functions (or models) from distributed streaming data.
We propose a communication-efficient OFL algorithm (named OFedQIT) by means of a quantization and an intermittent transmission.
Our analysis reveals that OFedQIT successfully addresses the drawbacks of OFedAvg while maintaining superior learning accuracy.
arXiv Detail & Related papers (2022-05-13T07:46:43Z) - Real-Time GPU-Accelerated Machine Learning Based Multiuser Detection for
5G and Beyond [70.81551587109833]
nonlinear beamforming filters can significantly outperform linear approaches in stationary scenarios with massive connectivity.
One of the main challenges comes from the real-time implementation of these algorithms.
This paper explores the acceleration of APSM-based algorithms through massive parallelization.
arXiv Detail & Related papers (2022-01-13T15:20:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.