AlpaServe: Statistical Multiplexing with Model Parallelism for Deep
Learning Serving
- URL: http://arxiv.org/abs/2302.11665v2
- Date: Wed, 19 Jul 2023 04:03:11 GMT
- Title: AlpaServe: Statistical Multiplexing with Model Parallelism for Deep
Learning Serving
- Authors: Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin
Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E. Gonzalez, Ion Stoica
- Abstract summary: We show that model parallelism can be used for the statistical multiplexing of multiple devices when serving multiple models.
We present a novel serving system, AlpaServe, that determines an efficient strategy for placing and parallelizing collections of large deep learning models.
- Score: 53.01646445659089
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Model parallelism is conventionally viewed as a method to scale a single
large deep learning model beyond the memory limits of a single device. In this
paper, we demonstrate that model parallelism can be additionally used for the
statistical multiplexing of multiple devices when serving multiple models, even
when a single model can fit into a single device. Our work reveals a
fundamental trade-off between the overhead introduced by model parallelism and
the opportunity to exploit statistical multiplexing to reduce serving latency
in the presence of bursty workloads. We explore the new trade-off space and
present a novel serving system, AlpaServe, that determines an efficient
strategy for placing and parallelizing collections of large deep learning
models across a distributed cluster. Evaluation results on production workloads
show that AlpaServe can process requests at up to 10x higher rates or 6x more
burstiness while staying within latency constraints for more than 99% of
requests.
Related papers
- FlexModel: A Framework for Interpretability of Distributed Large
Language Models [0.0]
We present FlexModel, a software package providing a streamlined interface for engaging with models distributed across multi- GPU and multi-node configurations.
The library is compatible with existing model distribution libraries and encapsulates PyTorch models.
It exposes user-registerable HookFunctions to facilitate straightforward interaction with distributed model internals.
arXiv Detail & Related papers (2023-12-05T21:19:33Z) - Saturn: An Optimized Data System for Large Model Deep Learning Workloads [6.377812618046872]
We tackle SPASE: Select a Parallelism, Allocate resources, and SchedulE.
We propose a new information system architecture to tackle the SPASE problem holistically.
We find that direct use of an MILP-solver is significantly more effective than several baselines.
arXiv Detail & Related papers (2023-09-03T17:19:11Z) - SWARM Parallelism: Training Large Models Can Be Surprisingly
Communication-Efficient [69.61083127540776]
Deep learning applications benefit from using large models with billions of parameters.
Training these models is notoriously expensive due to the need for specialized HPC clusters.
We consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions.
arXiv Detail & Related papers (2023-01-27T18:55:19Z) - Does compressing activations help model parallel training? [64.59298055364336]
We present the first empirical study on the effectiveness of compression methods for model parallelism.
We implement and evaluate three common classes of compression algorithms.
We evaluate these methods across more than 160 settings and 8 popular datasets.
arXiv Detail & Related papers (2023-01-06T18:58:09Z) - Dataless Knowledge Fusion by Merging Weights of Language Models [51.8162883997512]
Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models.
This creates a barrier to fusing knowledge across individual models to yield a better single model.
We propose a dataless knowledge fusion method that merges models in their parameter space.
arXiv Detail & Related papers (2022-12-19T20:46:43Z) - Hydra: A System for Large Multi-Model Deep Learning [3.571623412954477]
We present'model spilling', a technique aimed at models such as Transformers and CNNs to move groups of layers between DRAM and GPU memory.
We then present a set of novel techniques leveraging spilling to raise efficiency for multi-model training workloads.
Experiments with real benchmark workloads show that HYDRA is over 7x faster than regular model parallelism and over 50% faster than state-of-the-art industrial tools for pipeline parallelism.
arXiv Detail & Related papers (2021-10-16T18:13:57Z) - Model-Parallel Model Selection for Deep Learning Systems [0.0]
inefficiencies in machine learning (ML) training prevent practical usage of state-of-the-art models for most users.
Many ML practitioners have turned to model parallelism as a method of distributing the computational requirements across several devices.
We propose a new form of "shard parallelism" combining task and model parallelism, then package it into a framework we name Hydra.
arXiv Detail & Related papers (2021-07-14T03:20:37Z) - Ensemble Distillation for Robust Model Fusion in Federated Learning [72.61259487233214]
Federated Learning (FL) is a machine learning setting where many devices collaboratively train a machine learning model.
In most of the current training schemes the central model is refined by averaging the parameters of the server model and the updated parameters from the client side.
We propose ensemble distillation for model fusion, i.e. training the central classifier through unlabeled data on the outputs of the models from the clients.
arXiv Detail & Related papers (2020-06-12T14:49:47Z) - When Ensembling Smaller Models is More Efficient than Single Large
Models [52.38997176317532]
We show that ensembles can outperform single models with both higher accuracy and requiring fewer total FLOPs to compute.
This presents an interesting observation that output diversity in ensembling can often be more efficient than training larger models.
arXiv Detail & Related papers (2020-05-01T18:56:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.