Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit
Quantization and Robustness
- URL: http://arxiv.org/abs/2310.02410v1
- Date: Tue, 3 Oct 2023 20:11:23 GMT
- Title: Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit
Quantization and Robustness
- Authors: Young Jin Kim, Raffy Fahim, Hany Hassan Awadalla
- Abstract summary: Large Mixture of Experts (MoE) models could achieve state-of-the-art quality on various language tasks.
MoQE is a simple weight-only quantization method applying ultra low-bit down to 2-bit quantizations only to expert weights.
We show that low-bit quantization together with the MoE architecture delivers a reliable model performance.
- Score: 10.196942053244468
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Mixture of Experts (MoE) models could achieve state-of-the-art quality
on various language tasks, including machine translation task, thanks to the
efficient model scaling capability with expert parallelism. However, it has
brought a fundamental issue of larger memory consumption and increased memory
bandwidth bottleneck at deployment time. In this paper, we propose Mixture of
Quantized Experts (MoQE) which is a simple weight-only quantization method
applying ultra low-bit down to 2-bit quantizations only to expert weights for
mitigating the increased memory and latency issues of MoE models. We show that
low-bit quantization together with the MoE architecture delivers a reliable
model performance while reducing the memory size significantly even without any
additional training in most cases. In particular, expert layers in MoE models
are much more robust to the quantization than conventional feedforward networks
(FFN) layers. In our comprehensive analysis, we show that MoE models with 2-bit
expert weights can deliver better model performance than the dense model
trained on the same dataset. As a result of low-bit quantization, we show the
model size can be reduced by 79.6% of the original half precision floating
point (fp16) MoE model. Combined with an optimized GPU runtime implementation,
it also achieves 1.24X speed-up on A100 GPUs.
Related papers
- MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs [55.95879347182669]
MoE architecture is renowned for its ability to increase model capacity without a proportional increase in inference cost.
MoE-Lightning introduces a novel CPU-GPU-I/O pipelining schedule, CGOPipe, with paged weights to achieve high resource utilization.
MoE-Lightning can achieve up to 10.3x higher throughput than state-of-the-art offloading-enabled LLM inference systems for Mixtral 8x7B on a single T4 GPU (16GB)
arXiv Detail & Related papers (2024-11-18T01:06:12Z) - Examining Post-Training Quantization for Mixture-of-Experts: A Benchmark [46.72960840801211]
Mixture-of-Experts(MoE) approach offers a promising way to scale Large Language Models(LLMs)
MoE suffers from significant memory overheads, necessitating model compression techniques.
This paper explores several MoE structure-aware quantizations, ranging from coarse to fine granularity, from MoE block to individual linear weight.
arXiv Detail & Related papers (2024-06-12T12:44:48Z) - decoupleQ: Towards 2-bit Post-Training Uniform Quantization via decoupling Parameters into Integer and Floating Points [10.238677144792279]
decoupleQ abandons the traditional quantization paradigm and decouples the model parameters into integer and floating-point parts.
Our method has achieved well on-line accuracy near fp16/bf16 on the 2-bit quantization of large speech models in ByteDance.
arXiv Detail & Related papers (2024-04-19T10:02:53Z) - Multilinear Mixture of Experts: Scalable Expert Specialization through Factorization [51.98792406392873]
Mixture of Experts (MoE) provides a powerful way to decompose dense layers into smaller, modular computations.
A major challenge lies in the computational cost of scaling the number of experts high enough to achieve fine-grained specialization.
We propose the Multilinear Mixture of Experts ($mu$MoE) layer to address this, focusing on vision models.
arXiv Detail & Related papers (2024-02-19T21:20:22Z) - FineQuant: Unlocking Efficiency with Fine-Grained Weight-Only
Quantization for LLMs [9.072821427818557]
Large Language Models (LLMs) have achieved state-of-the-art performance across various language tasks but pose challenges for practical deployment.
We propose an efficient weight-only quantization method that reduces memory consumption and accelerates inference for LLMs.
We evaluate our approach on large-scale open source models such as OPT-175B and internal MoE models, showcasing minimal accuracy loss while achieving up to 3.65 times higher throughput.
arXiv Detail & Related papers (2023-08-16T23:57:41Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z) - Vertical Layering of Quantized Neural Networks for Heterogeneous
Inference [57.42762335081385]
We study a new vertical-layered representation of neural network weights for encapsulating all quantized models into a single one.
We can theoretically achieve any precision network for on-demand service while only needing to train and maintain one model.
arXiv Detail & Related papers (2022-12-10T15:57:38Z) - Who Says Elephants Can't Run: Bringing Large Scale MoE Models into Cloud
Scale Production [7.056223012587321]
We introduce a highly efficient inference framework with several optimization approaches to accelerate the computation of sparse models.
We are able to deploy 136x larger models with 27% less cost and significantly better quality compared to the existing solutions.
arXiv Detail & Related papers (2022-11-18T03:43:52Z) - Once Quantization-Aware Training: High Performance Extremely Low-bit
Architecture Search [112.05977301976613]
We propose to combine Network Architecture Search methods with quantization to enjoy the merits of the two sides.
We first propose the joint training of architecture and quantization with a shared step size to acquire a large number of quantized models.
Then a bit-inheritance scheme is introduced to transfer the quantized models to the lower bit, which further reduces the time cost and improves the quantization accuracy.
arXiv Detail & Related papers (2020-10-09T03:52:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.