Related papers: Dynamic Expert Quantization for Scalable Mixture-of-Experts Inference

Dynamic Expert Quantization for Scalable Mixture-of-Experts Inference

URL: http://arxiv.org/abs/2511.15015v2
Date: Mon, 24 Nov 2025 00:36:49 GMT
Title: Dynamic Expert Quantization for Scalable Mixture-of-Experts Inference
Authors: Kexin Chu, Dawei Xiang, Zixu Shen, Yiwei Yang, Zecheng Liu, Wei Zhang,
Abstract summary: We present DynaExq, a runtime system that treats expert precision as a first-class, dynamically managed resource.<n>We show that DynaExq deploys large LLMs on single 5090 and A6000 GPUs and improves accuracy by up to 4.03 points over static low-precision baselines.
Score: 2.649774320778185
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Mixture-of-Experts (MoE) models scale LLM capacity efficiently, but deployment on consumer GPUs is limited by the large memory footprint of inactive experts. Static post-training quantization reduces storage costs but cannot adapt to shifting activation patterns, causing accuracy loss under aggressive compression. So we present DynaExq, a runtime system that treats expert precision as a first-class, dynamically managed resource. DynaExq combines (1) a hotness-aware precision controller that continuously aligns expert bit-widths with long-term activation statistics, (2) a fully asynchronous precision-switching pipeline that overlaps promotion and demotion with MoE computation, and (3) a fragmentation-free memory pooling mechanism that supports hybrid-precision experts with deterministic allocation. Together, these components enable stable, non-blocking precision transitions under strict HBM budgets. Across Qwen3-30B and Qwen3-80B MoE models and six representative benchmarks, DynaExq deploys large LLMs on single RTX 5090 and A6000 GPUs and improves accuracy by up to 4.03 points over static low-precision baselines. The results show that adaptive, workload-aware quantization is an effective strategy for memory-constrained MoE serving.

Related papers

Advancing Model Refinement: Muon-Optimized Distillation and Quantization for LLM Deployment [1.0858059444801136]
Large Language Models (LLMs) enable advanced natural language processing but face deployment challenges on resource-constrained edge devices.<n>We propose an integrated framework combining GPTQ-based quantization, low-rank adaptation (LoRA), and a specialized data distillation process.
arXiv Detail & Related papers (2026-01-14T20:50:30Z)
MC#: Mixture Compressor for Mixture-of-Experts Large Models [86.64315380917827]
Mixture-of-Experts (MoE) effectively scales large language models (LLMs) and vision-language models (VLMs) by increasing capacity through sparse activation.<n>We propose MC# (Mixture-Compressor-sharp), a framework that combines static quantization and dynamic expert pruning.
arXiv Detail & Related papers (2025-10-13T03:12:46Z)
The Curious Case of In-Training Compression of State Space Models [49.819321766705514]
State Space Models (SSMs) tackle long sequence modeling tasks efficiently, offer both parallelizable training and fast inference.<n>Key design challenge is striking the right balance between maximizing expressivity and limiting this computational burden.<n>Our approach, textscCompreSSM, applies to Linear Time-Invariant SSMs such as Linear Recurrent Units, but is also extendable to selective models.
arXiv Detail & Related papers (2025-10-03T09:02:33Z)
PT$^2$-LLM: Post-Training Ternarization for Large Language Models [52.4629647715623]
Large Language Models (LLMs) have shown impressive capabilities across diverse tasks, but their large memory and compute demands hinder deployment.<n>We propose PT$2$-LLM, a post-training ternarization framework tailored for LLMs.<n>At its core is an Asymmetric Ternary Quantizer equipped with a two-stage refinement pipeline.
arXiv Detail & Related papers (2025-09-27T03:01:48Z)
FlexQuant: A Flexible and Efficient Dynamic Precision Switching Framework for LLM Quantization [19.12288373558071]
We propose FlexQuant, a dynamic precision-switching framework to optimize the trade-off between inference speed and accuracy.<n>We show that FlexQuant achieves a 1.3x end-to-end speedup across diverse language tasks with negligible accuracy loss.
arXiv Detail & Related papers (2025-05-21T07:42:53Z)
Quaff: Quantized Parameter-Efficient Fine-Tuning under Outlier Spatial Stability Hypothesis [9.884521812433661]
Quaff is a Quantized parameter-efficient fine-tuning framework for large language models.<n>It suppresses outliers exclusively invariant channels using lightweight operations.<n>It achieves a 1.73x latency reduction and 30% memory savings over full-precision fine-tuning.
arXiv Detail & Related papers (2025-05-20T07:19:36Z)
D$^{2}$MoE: Dual Routing and Dynamic Scheduling for Efficient On-Device MoE-based LLM Serving [14.607254882119507]
Combination of experts (MoE) model is a sparse variant of large language models (LLMs)<n>Despite its benefits, MoE is still too expensive to deploy on resource-constrained edge devices.<n>We propose D$2$MoE, an algorithm-system co-design framework that matches diverse task requirements by dynamically allocating the most proper bit-width to each expert.
arXiv Detail & Related papers (2025-04-17T05:37:35Z)
HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference [54.40808356999408]
We present HOBBIT, a mixed precision expert offloading system to enable flexible and efficient MoE inference. Our key insight is that dynamically replacing less critical cache-miss experts with low precision versions can substantially reduce expert-loading latency. HOBBIT achieves up to a 9.93x speedup in decoding compared to state-of-the-art MoE offloading systems.
arXiv Detail & Related papers (2024-11-03T04:25:46Z)
SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models [63.118592279833656]
Post-training quantization (PTQ) is an effective technique for compressing large language models (LLMs)<n>We propose SliM-LLM, a salience-driven mixed-precision quantization framework that allocates bit-widths at the group-wise.<n> Experiments show that SliM-LLM achieves superior performance across various LLMs at low bit-widths.
arXiv Detail & Related papers (2024-05-23T16:21:48Z)
SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference. We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit. Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.