Related papers: Mixture of Experts with Mixture of Precisions for Tuning Quality of Service

Mixture of Experts with Mixture of Precisions for Tuning Quality of Service

URL: http://arxiv.org/abs/2407.14417v1
Date: Fri, 19 Jul 2024 15:42:49 GMT
Title: Mixture of Experts with Mixture of Precisions for Tuning Quality of Service
Authors: HamidReza Imani, Abdolah Amirany, Tarek El-Ghazawi,
Abstract summary: This paper presents an adaptive serving approach for the efficient deployment of MoE models. By dynamically determining the number of quantized experts, we offer a fine-grained range of configurations for tuning throughput and model quality. Results highlight the practical applicability of our approach in dynamic and accuracy-sensitive applications.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The increasing demand for deploying large Mixture-of-Experts (MoE) models in resource-constrained environments necessitates efficient approaches to address their high memory and computational requirements challenges. Moreover, given that tasks come in different user-defined constraints and the available resources change over time in multi-tenant environments, it is necessary to design an approach which provides a flexible configuration space. This paper presents an adaptive serving approach for the efficient deployment of MoE models, capitalizing on partial quantization of the experts. By dynamically determining the number of quantized experts and their distribution across CPU and GPU, our approach explores the Pareto frontier and offers a fine-grained range of configurations for tuning throughput and model quality. Our evaluation on an NVIDIA A100 GPU using a Mixtral 8x7B MoE model for three language modelling benchmarks demonstrates that the throughput of token generation can be adjusted from 0.63 to 13.00 token per second. This enhancement comes with a marginal perplexity increase of 2.62 to 2.80, 6.48 to 7.24, and 3.24 to 3.53 for WikiText2, PTB, and C4 datasets respectively under maximum quantization. These results highlight the practical applicability of our approach in dynamic and accuracy-sensitive applications where both memory usage and output quality are important.

Related papers

EfficientLLM: Efficiency in Large Language Models [64.3537131208038]
Large Language Models (LLMs) have driven significant progress, yet their growing counts and context windows incur prohibitive compute, energy, and monetary costs.<n>We introduce EfficientLLM, a novel benchmark and the first comprehensive empirical study evaluating efficiency techniques for LLMs at scale.
arXiv Detail & Related papers (2025-05-20T02:27:08Z)
QoS-Efficient Serving of Multiple Mixture-of-Expert LLMs Using Partial Runtime Reconfiguration [0.0]
A server with a single NVIDIA A100 GPU (80GB) using Mixtral-8x7B models demonstrate an 85% average reduction in turnaround time compared to NVIDIA's multi-instance GPU (MIG)<n> experiments on Google's Switch Transformer Base-8 model with up to four variants demonstrate the scalability and resilience of our approach in maintaining output quality compared to other model merging baselines, highlighting its effectiveness.
arXiv Detail & Related papers (2025-05-10T00:46:04Z)
MxMoE: Mixed-precision Quantization for MoE with Accuracy and Performance Co-Design [41.7649957078564]
MxMoE is a mixed-precision optimization framework for Mixture-of-Experts (MoE) models.<n>MxMoE navigates the design space defined by parameter sensitivity, expert activation dynamics, and hardware resources to derive efficient mixed-precision configurations.
arXiv Detail & Related papers (2025-05-09T05:32:21Z)
DiffMoE: Dynamic Token Selection for Scalable Diffusion Transformers [86.5541501589166]
DiffMoE is a batch-level global token pool that enables experts to access global token distributions during training. It achieves state-of-the-art performance among diffusion models on ImageNet benchmark. The effectiveness of our approach extends beyond class-conditional generation to more challenging tasks such as text-to-image generation.
arXiv Detail & Related papers (2025-03-18T17:57:07Z)
Matryoshka Quantization [19.46665026740268]
We propose Matryoshka Quantization (MatQuant), a novel multi-scale quantization technique. MatQuant allows us to train and maintain a single quantized model but serve it with the precision demanded by the deployment.
arXiv Detail & Related papers (2025-02-10T18:59:10Z)
"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization [67.3213104337679]
We evaluate popular quantization formats across academic benchmarks and real-world tasks. We find that W4A16 offers the best costefficiency for synchronous deployments, and for asynchronous deployment on mid-tier architectures.
arXiv Detail & Related papers (2024-11-04T18:21:59Z)
MC-MoE: Mixture Compressor for Mixture-of-Experts LLMs Gains More [71.0473038084673]
We propose MC-MoE, a training-free Mixture-Compressor for Mixture-of-Experts large language models (MoE-LLMs) MC-MoE leverages the significance of both experts and tokens to achieve an extreme compression. For instance, at 2.54 bits, MC-MoE compresses 76.6% of the model, with only a 3.8% average accuracy loss.
arXiv Detail & Related papers (2024-10-08T18:09:38Z)
GRIN: GRadient-INformed MoE [132.87651078514122]
Mixture-of-Experts (MoE) models scale more effectively than dense models due to sparse computation through expert routing. We introduce GRIN (GRadient-INformed MoE training), which incorporates sparse gradient estimation for expert routing. Our model, with only 6.6B activated parameters, outperforms a 7B dense model and matches the performance of a 14B dense model trained on the same data.
arXiv Detail & Related papers (2024-09-18T17:00:20Z)
QuantMoE-Bench: Examining Post-Training Quantization for Mixture-of-Experts [47.01697456105496]
Mixture-of-Experts (MoE) is a promising way to scale up the learning capacity of large language models. MoE suffers from significant memory overheads due to the vast parameter size. Post-training quantization offers a powerful approach for model compression.
arXiv Detail & Related papers (2024-06-12T12:44:48Z)
Advancing LLM Reasoning Generalists with Preference Trees [119.57169648859707]
We introduce Eurus, a suite of large language models (LLMs) optimized for reasoning. Eurus models achieve state-of-the-art results among open-source models on a diverse set of benchmarks.
arXiv Detail & Related papers (2024-04-02T16:25:30Z)
Efficient Post-training Quantization with FP8 Formats [14.543387418837154]
We study the advantages of FP8 data formats for post-training quantization across 75 unique network architectures. E4M3 is better suited for NLP models, whereas E3M4 performs marginally better than E4M3 on computer vision tasks.
arXiv Detail & Related papers (2023-09-26T00:58:36Z)
SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference. We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit. Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z)
A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training [13.346719319555943]
Mixture-of-Experts (MoE) is a neural network architecture that adds sparsely activated expert blocks to a base model. Current distributed deep learning frameworks are limited in their ability to train high-quality MoE models with large base models. We present DeepSpeed-TED, a novel, three-dimensional, hybrid parallel algorithm that combines data, tensor, and expert parallelism.
arXiv Detail & Related papers (2023-03-11T05:38:15Z)
ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers [29.566132632781848]
We present an efficient and affordable post-training quantization approach to compress large Transformer-based models, termed as ZeroQuant. ZeroQuant is an end-to-end quantization and inference pipeline with three main components.
arXiv Detail & Related papers (2022-06-04T00:28:21Z)
Generative Design of Hardware-aware DNNs [6.144349819246314]
We propose a new way for autonomous quantization and HW-aware tuning. A generative model, AQGAN, takes a target accuracy as the condition and generates a suite of quantization configurations. We evaluate our model on five of the widely-used efficient models on the ImageNet dataset.
arXiv Detail & Related papers (2020-06-06T20:39:25Z)
Training with Quantization Noise for Extreme Model Compression [57.51832088938618]
We tackle the problem of producing compact models, maximizing their accuracy for a given model size. A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator. In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression methods.
arXiv Detail & Related papers (2020-04-15T20:10:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.