Related papers: ME-Switch: A Memory-Efficient Expert Switching Framework for Large Language Models

ME-Switch: A Memory-Efficient Expert Switching Framework for Large Language Models

URL: http://arxiv.org/abs/2406.09041v1
Date: Thu, 13 Jun 2024 12:27:55 GMT
Title: ME-Switch: A Memory-Efficient Expert Switching Framework for Large Language Models
Authors: Jing Liu, Ruihao Gong, Mingyang Zhang, Yefei He, Jianfei Cai, Bohan Zhuang,
Abstract summary: We introduce ME-Switch, a memory-efficient expert switching framework for LLM serving. Me-Switch uses mixed-precision quantization, selectively quantizing non-salient input channels of delta weights to extremely low bits. Me-Switch can efficiently serve 16 models from the Mistral-7B family on a single NVIDIA A100 GPU.
Score: 43.29533894162248
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The typical process for developing LLMs involves pre-training a general foundation model on massive data, followed by fine-tuning on task-specific data to create specialized experts. Serving these experts poses challenges, as loading all experts onto devices is impractical, and frequent switching between experts in response to user requests incurs substantial I/O costs, increasing latency and expenses. Previous approaches decompose expert weights into pre-trained model weights and residual delta weights, then quantize the delta weights to reduce model size. However, these methods often lead to significant quantization errors at extremely low bitwidths and assume the appropriate model for a user request is known in advance, which is not practical. To address these issues, we introduce ME-Switch, a memory-efficient expert switching framework for LLM serving. ME-Switch uses mixed-precision quantization, selectively quantizing non-salient input channels of delta weights to extremely low bits while keeping salient ones intact, significantly reducing storage demands while maintaining performance. Additionally, we develop a routing method that efficiently directs user queries to the most suitable expert by transforming the model selection problem into a domain classification problem. Extensive experiments show ME-Switch's promising memory efficiency and routing performance. For example, when serving three models from the Mistral-7B family, ME-Switch reduces model size by 1.74x while maintaining nearly lossless performance on instruction, mathematical reasoning, and code generation tasks. Furthermore, ME-Switch can efficiently serve 16 models from the Mistral-7B family on a single NVIDIA A100 GPU.

Related papers

MoTE: Mixture of Ternary Experts for Memory-efficient Large Multimodal Models [36.730689832979365]
MoTE is a scalable and memory-efficient approach to train Mixture-of-Ternary-Experts models from dense checkpoint.<n>MoTE achieves comparable performance to full-precision baseline MoE-LLaVA while offering lower memory footprint.
arXiv Detail & Related papers (2025-06-17T11:53:49Z)
MoLEx: Mixture of Layer Experts for Finetuning with Sparse Upcycling [2.1605931466490795]
Large-scale pre-training of deep models, followed by fine-tuning them, has become the cornerstone of natural language processing (NLP) In this paper, we study layers as extractors of different types of linguistic information that are valuable when used in conjunction. We propose the Mixture of Layer Experts (MoLEx), a novel sparse mixture of experts whose experts are layers in the pre-trained model.
arXiv Detail & Related papers (2025-03-14T07:22:07Z)
FTP: A Fine-grained Token-wise Pruner for Large Language Models via Token Routing [17.01412432658081]
Large language models (LLMs) have demonstrated superior performance across various tasks by adhering to scaling laws. We propose a fine-grained token-wise pruning approach for the LLMs, which presents a learnable router to adaptively identify the less important tokens. Our approach achieves state-of-the-art (SOTA) pruning results, surpassing other existing pruning methods.
arXiv Detail & Related papers (2024-12-16T07:09:46Z)
HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference [54.40808356999408]
We present HOBBIT, a mixed precision expert offloading system to enable flexible and efficient MoE inference. Our key insight is that dynamically replacing less critical cache-miss experts with low precision versions can substantially reduce expert-loading latency. HOBBIT achieves up to a 9.93x speedup in decoding compared to state-of-the-art MoE offloading systems.
arXiv Detail & Related papers (2024-11-03T04:25:46Z)
PAT: Pruning-Aware Tuning for Large Language Models [19.622152991641045]
Large language models excel in language tasks, especially with supervised fine-tuning after pre-training. Traditional post-hoc pruning often leads to significant performance loss. We propose the Pruning-Aware Tuning (PAT) paradigm to eliminate model redundancy.
arXiv Detail & Related papers (2024-08-27T01:04:14Z)
A Provably Effective Method for Pruning Experts in Fine-tuned Sparse Mixture-of-Experts [49.394145046409044]
This paper provides the first provably efficient technique for pruning experts in finetuned MoE models. We theoretically prove that prioritizing the pruning of the experts with a smaller change of the routers l2 norm from the pretrained model guarantees the preservation of test accuracy. Although our theoretical analysis is centered on binary classification tasks on simplified MoE architecture, our expert pruning method is verified on large vision MoE models.
arXiv Detail & Related papers (2024-05-26T17:52:58Z)
Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models [90.14693869269519]
MoE LLMs can achieve higher performance with fewer parameters, but it is still hard to deploy them due to their immense parameter sizes. This paper mainly aims to enhance the deployment efficiency of MoE LLMs by introducing plug-and-play expert-level sparsification techniques.
arXiv Detail & Related papers (2024-02-22T18:56:07Z)
BitDelta: Your Fine-Tune May Only Be Worth One Bit [57.558376557639555]
Large Language Models (LLMs) are typically trained in two phases: pre-training on large internet-scale datasets, and fine-tuning for downstream tasks. We introduce a simple method, BitDelta, which successfully quantizes this delta down to 1 bit without compromising performance. By enabling the use of a single high-precision base model accompanied by multiple 1-bit deltas, BitDelta dramatically reduces GPU memory requirements by more than 10x.
arXiv Detail & Related papers (2024-02-15T18:50:06Z)
Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit Quantization and Robustness [10.196942053244468]
Large Mixture of Experts (MoE) models could achieve state-of-the-art quality on various language tasks. MoQE is a simple weight-only quantization method applying ultra low-bit down to 2-bit quantizations only to expert weights. We show that low-bit quantization together with the MoE architecture delivers a reliable model performance.
arXiv Detail & Related papers (2023-10-03T20:11:23Z)
EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models [3.597163516372061]
EdgeMoE is an on-device inference engine tailored for mixture-of-expert (MoE) LLMs. It achieves both memory and computational efficiency by strategically partitioning the model across the storage hierarchy. It demonstrates substantial memory savings and performance improvements when compared to competitive baseline solutions.
arXiv Detail & Related papers (2023-08-28T06:56:08Z)
FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only Quantization for LLMs [9.072821427818557]
Large Language Models (LLMs) have achieved state-of-the-art performance across various language tasks but pose challenges for practical deployment. We propose an efficient weight-only quantization method that reduces memory consumption and accelerates inference for LLMs. We evaluate our approach on large-scale open source models such as OPT-175B and internal MoE models, showcasing minimal accuracy loss while achieving up to 3.65 times higher throughput.
arXiv Detail & Related papers (2023-08-16T23:57:41Z)
BASE Layers: Simplifying Training of Large, Sparse Models [53.98145464002843]
We introduce a new balanced assignment of experts (BASE) layer for large language models. Sparse layers can dramatically improve the efficiency of training and inference by routing each token to specialized expert modules. We formulate token-to-expert allocation as a linear assignment problem, allowing an optimal assignment in which each expert receives an equal number of tokens.
arXiv Detail & Related papers (2021-03-30T23:08:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.