Related papers: DeltaZip: Efficient Serving of Multiple Full-Model-Tuned LLMs

DeltaZip: Efficient Serving of Multiple Full-Model-Tuned LLMs

URL: http://arxiv.org/abs/2312.05215v2
Date: Fri, 01 Nov 2024 21:56:48 GMT
Title: DeltaZip: Efficient Serving of Multiple Full-Model-Tuned LLMs
Authors: Xiaozhe Yao, Qinghao Hu, Ana Klimovic,
Abstract summary: Fine-tuning large language models (LLMs) greatly improves model quality for downstream tasks. serving many fine-tuned LLMs concurrently is challenging due to the sporadic, bursty, and varying request patterns. We present DeltaZip, an LLM serving system that efficiently serves multiple full- parameter fine-tuned models concurrently.
Score: 7.1597349516197655
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Fine-tuning large language models (LLMs) greatly improves model quality for downstream tasks. However, serving many fine-tuned LLMs concurrently is challenging due to the sporadic, bursty, and varying request patterns of different LLMs. To bridge this gap, we present DeltaZip, an LLM serving system that efficiently serves multiple full-parameter fine-tuned models concurrently by aggressively compressing model deltas by up to 10x while maintaining high model quality. The key insight behind this design is that fine-tuning results in small-magnitude changes to the pre-trained model. By co-designing the serving system with the compression algorithm, DeltaZip achieves 2x to 12x improvement in throughput compared to the state-of-the-art systems.

Related papers

AutoHete: An Automatic and Efficient Heterogeneous Training System for LLMs [68.99086112477565]
Transformer-based large language models (LLMs) have demonstrated exceptional capabilities in sequence modeling and text generation. Existing heterogeneous training methods significantly expand the scale of trainable models but introduce substantial communication overheads and CPU workloads. We propose AutoHete, an automatic and efficient heterogeneous training system compatible with both single- GPU and multi- GPU environments.
arXiv Detail & Related papers (2025-02-27T14:46:22Z)
Mamba-Shedder: Post-Transformer Compression for Efficient Selective Structured State Space Models [1.8434042562191815]
This paper explores the compression of SSM-based models, particularly Mamba and its hybrids. We study the sensitivity of these models to the removal of selected components at different granularities to reduce the model size and computational overhead, thus improving their efficiency while maintaining accuracy. The proposed solutions, collectively referred to as Mamba-Shedder, achieve a speedup of up to 1.4x during inference, demonstrating that model efficiency can be improved by eliminating several redundancies with minimal impact on the overall model performance.
arXiv Detail & Related papers (2025-01-28T17:22:01Z)
DeltaDQ: Ultra-High Delta Compression for Fine-Tuned LLMs via Group-wise Dropout and Separate Quantization [17.501956455837707]
Large language models achieve exceptional performance on various downstream tasks through supervised fine-tuning. Current methods that compress the delta weight struggle to achieve ultra-high compression. We propose a novel distribution-driven delta compression framework DeltaDQ to achieve ultra-high compression for the delta weight.
arXiv Detail & Related papers (2024-10-11T09:44:16Z)
Pruning Large Language Models with Semi-Structural Adaptive Sparse Training [17.381160429641316]
We propose a pruning pipeline for semi-structured sparse models via retraining, termed Adaptive Sparse Trainer (AST) AST transforms dense models into sparse ones by applying decay to masked weights while allowing the model to adaptively select masks throughout the training process. Our work demonstrates the feasibility of deploying semi-structured sparse large language models and introduces a novel method for achieving highly compressed models.
arXiv Detail & Related papers (2024-07-30T06:33:44Z)
Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for Large Language Models [79.46938238953916]
Fine-tuning large language models (LLMs) to diverse applications is crucial to meet complex demands. Recent studies suggest decomposing a fine-tuned LLM into a base model and corresponding delta weights, which are then compressed using low-rank or low-bit approaches to reduce costs. In this work, we observe that existing low-rank and low-bit compression methods can significantly harm the model performance for task-specific fine-tuned LLMs.
arXiv Detail & Related papers (2024-06-13T07:57:27Z)
EMR-Merging: Tuning-Free High-Performance Model Merging [55.03509900949149]
We show that Elect, Mask & Rescale-Merging (EMR-Merging) shows outstanding performance compared to existing merging methods. EMR-Merging is tuning-free, thus requiring no data availability or any additional training while showing impressive performance.
arXiv Detail & Related papers (2024-05-23T05:25:45Z)
BitDelta: Your Fine-Tune May Only Be Worth One Bit [57.558376557639555]
Large Language Models (LLMs) are typically trained in two phases: pre-training on large internet-scale datasets, and fine-tuning for downstream tasks. We introduce a simple method, BitDelta, which successfully quantizes this delta down to 1 bit without compromising performance. By enabling the use of a single high-precision base model accompanied by multiple 1-bit deltas, BitDelta dramatically reduces GPU memory requirements by more than 10x.
arXiv Detail & Related papers (2024-02-15T18:50:06Z)
Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM Inference with Transferable Prompt [96.24800696597707]
We introduce a new perspective to optimize this trade-off by prompting compressed models. We propose a soft prompt learning method where we expose the compressed model to the prompt learning process. Our experimental analysis suggests our soft prompt strategy greatly improves the performance of the 8x compressed LLaMA-7B model.
arXiv Detail & Related papers (2023-05-17T20:45:13Z)
MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers [78.85346970193518]
Megabyte is a multi-scale decoder architecture that enables end-to-end differentiable modeling of sequences of over one million bytes. Experiments show that Megabyte allows byte-level models to perform competitively with subword models on long context language modeling. Results establish the viability of tokenization-free autoregressive sequence modeling at scale.
arXiv Detail & Related papers (2023-05-12T00:55:41Z)
ZipLM: Inference-Aware Structured Pruning of Language Models [56.52030193434863]
We propose a novel structured compression approach for large language models (LLMs) called ZipLM. ZipLM achieves state-of-the-art accuracy-vs-speedup, while matching a set of desired target runtime speedups. ZipLM produces state-of-the-art compressed models across all settings.
arXiv Detail & Related papers (2023-02-07T18:55:28Z)
Multi-stage Progressive Compression of Conformer Transducer for On-device Speech Recognition [7.450574974954803]
Small memory bandwidth in smart devices prompts development of smaller Automatic Speech Recognition (ASR) models. Knowledge distillation (KD) is a popular model compression approach that has shown to achieve smaller model size. We propose a multi-stage progressive approach to compress the conformer transducer model using KD.
arXiv Detail & Related papers (2022-10-01T02:23:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.