Related papers: ME-Switch: A Memory-Efficient Expert Switching Framework for Large Language Models

ME-Switch: A Memory-Efficient Expert Switching Framework for Large Language Models

URL: http://arxiv.org/abs/2406.09041v2
Date: Sat, 26 Oct 2024 15:55:49 GMT
Title: ME-Switch: A Memory-Efficient Expert Switching Framework for Large Language Models
Authors: Jing Liu, Ruihao Gong, Mingyang Zhang, Yefei He, Jianfei Cai, Bohan Zhuang,
Abstract summary: LLM development involves pre-training a foundation model on massive data, followed by fine-tuning on task-specific data to create specialized experts. Previous approaches decompose the expert weights as the pre-trained weights plus delta weights, followed by quantizing the delta weights to reduce the model size. We introduce ME-Switch, a memory-efficient expert switching framework tailored for serving multiple LLMs.
Score: 43.29533894162248
License: http://creativecommons.org/licenses/by/4.0/
Abstract: LLM development involves pre-training a foundation model on massive data, followed by fine-tuning on task-specific data to create specialized experts. Serving these experts can pose significant memory challenges, as loading all experts onto devices is impractical, and frequent switching between experts in response to user requests can incur substantial I/O costs. Previous approaches decompose the expert weights as the pre-trained weights plus delta weights, followed by quantizing the delta weights using output channel-wise step sizes to reduce the model size. However, these methods overlook the fact that certain input channels of delta weights can cause significant quantization errors at extremely low bitwidths. Additionally, existing methods assume that the appropriate model for a user request is known in advance, which is not the case in practice. To this end, we introduce ME-Switch, a memory-efficient expert switching framework tailored for serving multiple LLMs. To condense the number of bits required for describing the delta weights, we propose a salient-aware delta compression method that identifies salient input channels based on reconstruction error and applies mixed-precision quantization, reducing non-salient channels to low bits while keeping salient ones intact, cutting storage demand without compromising performance. Moreover, we develop a model-level routing method that efficiently directs user queries to the most suitable expert by performing domain classification. Extensive experiments show the promising memory efficiency and routing performance of ME-Switch. For example, when serving three models from the Mistral-7B family, ME-Switch reduces the model size by $1.74\times$ and maintains nearly lossless performance on instruction, mathematical reasoning, and code generation tasks. Notably, our method can efficiently serve 16 Mistral-7B models on a single NVIDIA A100 GPU.

Related papers

MoLEx: Mixture of Layer Experts for Finetuning with Sparse Upcycling [2.1605931466490795]
Large-scale pre-training of deep models, followed by fine-tuning them, has become the cornerstone of natural language processing (NLP) In this paper, we study layers as extractors of different types of linguistic information that are valuable when used in conjunction. We propose the Mixture of Layer Experts (MoLEx), a novel sparse mixture of experts whose experts are layers in the pre-trained model.
arXiv Detail & Related papers (2025-03-14T07:22:07Z)
FTP: A Fine-grained Token-wise Pruner for Large Language Models via Token Routing [17.01412432658081]
Large language models (LLMs) have demonstrated superior performance across various tasks by adhering to scaling laws. We propose a fine-grained token-wise pruning approach for the LLMs, which presents a learnable router to adaptively identify the less important tokens. Our approach achieves state-of-the-art (SOTA) pruning results, surpassing other existing pruning methods.
arXiv Detail & Related papers (2024-12-16T07:09:46Z)
HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference [54.40808356999408]
We present HOBBIT, a mixed precision expert offloading system to enable flexible and efficient MoE inference. Our key insight is that dynamically replacing less critical cache-miss experts with low precision versions can substantially reduce expert-loading latency. HOBBIT achieves up to a 9.93x speedup in decoding compared to state-of-the-art MoE offloading systems.
arXiv Detail & Related papers (2024-11-03T04:25:46Z)
PAT: Pruning-Aware Tuning for Large Language Models [19.622152991641045]
Large language models excel in language tasks, especially with supervised fine-tuning after pre-training. Traditional post-hoc pruning often leads to significant performance loss. We propose the Pruning-Aware Tuning (PAT) paradigm to eliminate model redundancy.
arXiv Detail & Related papers (2024-08-27T01:04:14Z)
A Provably Effective Method for Pruning Experts in Fine-tuned Sparse Mixture-of-Experts [49.394145046409044]
This paper provides the first provably efficient technique for pruning experts in finetuned MoE models. We theoretically prove that prioritizing the pruning of the experts with a smaller change of the routers l2 norm from the pretrained model guarantees the preservation of test accuracy. Although our theoretical analysis is centered on binary classification tasks on simplified MoE architecture, our expert pruning method is verified on large vision MoE models.
arXiv Detail & Related papers (2024-05-26T17:52:58Z)
Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models [90.14693869269519]
MoE LLMs can achieve higher performance with fewer parameters, but it is still hard to deploy them due to their immense parameter sizes. This paper mainly aims to enhance the deployment efficiency of MoE LLMs by introducing plug-and-play expert-level sparsification techniques.
arXiv Detail & Related papers (2024-02-22T18:56:07Z)
BitDelta: Your Fine-Tune May Only Be Worth One Bit [57.558376557639555]
Large Language Models (LLMs) are typically trained in two phases: pre-training on large internet-scale datasets, and fine-tuning for downstream tasks. We introduce a simple method, BitDelta, which successfully quantizes this delta down to 1 bit without compromising performance. By enabling the use of a single high-precision base model accompanied by multiple 1-bit deltas, BitDelta dramatically reduces GPU memory requirements by more than 10x.
arXiv Detail & Related papers (2024-02-15T18:50:06Z)
Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit Quantization and Robustness [10.196942053244468]
Large Mixture of Experts (MoE) models could achieve state-of-the-art quality on various language tasks. MoQE is a simple weight-only quantization method applying ultra low-bit down to 2-bit quantizations only to expert weights. We show that low-bit quantization together with the MoE architecture delivers a reliable model performance.
arXiv Detail & Related papers (2023-10-03T20:11:23Z)
EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models [3.597163516372061]
EdgeMoE is an on-device inference engine tailored for mixture-of-expert (MoE) LLMs. It achieves both memory and computational efficiency by strategically partitioning the model across the storage hierarchy. It demonstrates substantial memory savings and performance improvements when compared to competitive baseline solutions.
arXiv Detail & Related papers (2023-08-28T06:56:08Z)
FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs [9.072821427818557]
Large Language Models (LLMs) have achieved state-of-the-art performance across various language tasks but pose challenges for practical deployment. We propose an efficient weight-only quantization method that reduces memory consumption and accelerates inference for LLMs. We evaluate our approach on large-scale open source models such as OPT-175B and internal MoE models, showcasing minimal accuracy loss while achieving up to 3.65 times higher throughput.
arXiv Detail & Related papers (2023-08-16T23:57:41Z)
BASE Layers: Simplifying Training of Large, Sparse Models [53.98145464002843]
We introduce a new balanced assignment of experts (BASE) layer for large language models. Sparse layers can dramatically improve the efficiency of training and inference by routing each token to specialized expert modules. We formulate token-to-expert allocation as a linear assignment problem, allowing an optimal assignment in which each expert receives an equal number of tokens.
arXiv Detail & Related papers (2021-03-30T23:08:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.