DeltaZip: Multi-Tenant Language Model Serving via Delta Compression
- URL: http://arxiv.org/abs/2312.05215v1
- Date: Fri, 8 Dec 2023 18:07:05 GMT
- Title: DeltaZip: Multi-Tenant Language Model Serving via Delta Compression
- Authors: Xiaozhe Yao, Ana Klimovic
- Abstract summary: We propose DeltaZip, an LLM serving system that efficiently serves multiple fine-tuned models concurrently.
DeltaZip increases serving throughput by $1.5times$ to $3times$ and improves SLO attainment compared to a vanilla HuggingFace serving system.
- Score: 0.479814360045118
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Fine-tuning large language models (LLMs) for downstream tasks can greatly
improve model quality, however serving many different fine-tuned LLMs
concurrently for users in multi-tenant environments is challenging. Dedicating
GPU memory for each model is prohibitively expensive and naively swapping large
model weights in and out of GPU memory is slow. Our key insight is that
fine-tuned models can be quickly swapped in and out of GPU memory by extracting
and compressing the delta between each model and its pre-trained base model. We
propose DeltaZip, an LLM serving system that efficiently serves multiple
full-parameter fine-tuned models concurrently by aggressively compressing model
deltas by a factor of $6\times$ to $8\times$ while maintaining high model
quality. DeltaZip increases serving throughput by $1.5\times$ to $3\times$ and
improves SLO attainment compared to a vanilla HuggingFace serving system.
Related papers
- Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models [62.4691912312317]
Mixture-of-Experts (MoE) language models can reduce computational costs by 2-4$times$ compared to dense models without sacrificing performance.
We propose a hybrid dense training and sparse inference framework for MoE models (DS-MoE) which achieves strong computation and parameter efficiency.
arXiv Detail & Related papers (2024-04-08T14:39:49Z) - When Do We Not Need Larger Vision Models? [55.957626371697785]
Scaling up the size of vision models has been the de facto standard to obtain more powerful visual representations.
We demonstrate the power of Scaling on Scales (S$2$), whereby a pre-trained and frozen smaller vision model can outperform larger models.
We release a Python package that can apply S$2$ on any vision model with one line of code.
arXiv Detail & Related papers (2024-03-19T17:58:39Z) - BitDelta: Your Fine-Tune May Only Be Worth One Bit [60.44468282930883]
Large Language Models (LLMs) are typically trained in two phases: pre-training on large internet-scale datasets, and fine-tuning for downstream tasks.
We introduce a simple method, BitDelta, which successfully quantizes this delta down to 1 bit without compromising performance.
By enabling the use of a single high-precision base model accompanied by multiple 1-bit deltas, BitDelta dramatically reduces GPU memory requirements by more than 10x.
arXiv Detail & Related papers (2024-02-15T18:50:06Z) - Herd: Using multiple, smaller LLMs to match the performances of
proprietary, large LLMs via an intelligent composer [1.0878040851637998]
We show that a herd of open source models can match or exceed the performance of proprietary models via an intelligent router.
In cases where GPT is not able to answer the query, Herd is able to identify a model that can, at least 40% of the time.
arXiv Detail & Related papers (2023-10-30T18:11:02Z) - Computron: Serving Distributed Deep Learning Models with Model Parallel
Swapping [5.429059120074075]
Many of the most performant deep learning models today in fields like language and image understanding contain billions of parameters.
We develop Computron, a system that uses memory swapping to serve multiple distributed models on a shared GPU cluster.
arXiv Detail & Related papers (2023-06-24T01:38:23Z) - ZipLM: Inference-Aware Structured Pruning of Language Models [56.52030193434863]
We propose a novel structured compression approach for large language models (LLMs) called ZipLM.
ZipLM achieves state-of-the-art accuracy-vs-speedup, while matching a set of desired target runtime speedups.
ZipLM produces state-of-the-art compressed models across all settings.
arXiv Detail & Related papers (2023-02-07T18:55:28Z) - Petals: Collaborative Inference and Fine-tuning of Large Models [78.37798144357977]
Many NLP tasks benefit from using large language models (LLMs) that often have more than 100 billion parameters.
With the release of BLOOM-176B and OPT-175B, everyone can download pretrained models of this scale.
We propose Petals $-$ a system for inference and fine-tuning of large models collaboratively by joining the resources of multiple parties.
arXiv Detail & Related papers (2022-09-02T17:38:03Z) - Hydra: A System for Large Multi-Model Deep Learning [3.571623412954477]
We present'model spilling', a technique aimed at models such as Transformers and CNNs to move groups of layers between DRAM and GPU memory.
We then present a set of novel techniques leveraging spilling to raise efficiency for multi-model training workloads.
Experiments with real benchmark workloads show that HYDRA is over 7x faster than regular model parallelism and over 50% faster than state-of-the-art industrial tools for pipeline parallelism.
arXiv Detail & Related papers (2021-10-16T18:13:57Z) - Exploring Sparse Expert Models and Beyond [51.90860155810848]
Mixture-of-Experts (MoE) models can achieve promising results with outrageous large amount of parameters but constant computation cost.
We propose a simple method called expert prototyping that splits experts into different prototypes and applies $k$ top-$1$ routing.
This strategy improves the model quality but maintains constant computational costs, and our further exploration on extremely large-scale models reflects that it is more effective in training larger models.
arXiv Detail & Related papers (2021-05-31T16:12:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.