Bandwidth-Efficient Adaptive Mixture-of-Experts via Low-Rank Compensation
- URL: http://arxiv.org/abs/2512.17073v1
- Date: Thu, 18 Dec 2025 21:15:54 GMT
- Title: Bandwidth-Efficient Adaptive Mixture-of-Experts via Low-Rank Compensation
- Authors: Zhenyu Liu, Yunzhen Liu, Zehao Fan, Garrett Gagnon, Yayue Hou, Nan Wu, Yangwook Kang, Liu Liu,
- Abstract summary: We present Bandwidth-Efficient Adaptive Mixture-of-Experts via Low-Rank Compensation.<n>Our method delivers a superior bandwidth-accuracy trade-off and improved throughput.
- Score: 11.078693613992556
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mixture-of-Experts (MoE) models scale capacity via sparse activation but stress memory and bandwidth. Offloading alleviates GPU memory by fetching experts on demand, yet token-level routing causes irregular transfers that make inference I/O-bound. Static uniform quantization reduces traffic but degrades accuracy under aggressive compression by ignoring expert heterogeneity. We present Bandwidth-Efficient Adaptive Mixture-of-Experts via Low-Rank Compensation, which performs router-guided precision restoration using precomputed low-rank compensators. At inference time, our method transfers compact low-rank factors with Top-n (n<k) experts per token and applies compensation to them, keeping others low-bit. Integrated with offloading on GPU and GPU-NDP systems, our method delivers a superior bandwidth-accuracy trade-off and improved throughput.
Related papers
- Squeezing More from the Stream : Learning Representation Online for Streaming Reinforcement Learning [14.799267729619428]
In streaming Reinforcement Learning (RL), transitions are observed and discarded immediately after a single update.<n>We propose extending Self-Predictive Representations (SPR) to the streaming pipeline to maximize the utility of every observed frame.<n>We show that our method learns significantly richer representations, bridging the performance gap caused by the absence of a replay buffer.
arXiv Detail & Related papers (2026-02-10T04:06:32Z) - Generalizing GNNs with Tokenized Mixture of Experts [75.8310720413187]
We show that improving stability requires reducing reliance on shift-sensitive features, leaving an irreducible worst-case generalization floor.<n>We propose STEM-GNN, a pretrain-then-finetune framework with a mixture-of-experts encoder for diverse computation paths.<n>Across nine node, link, and graph benchmarks, STEM-GNN achieves a stronger three-way balance, improving robustness to degree/homophily shifts and to feature/edge corruptions while remaining competitive on clean graphs.
arXiv Detail & Related papers (2026-02-09T22:48:30Z) - Bandwidth-constrained Variational Message Encoding for Cooperative Multi-agent Reinforcement Learning [23.866517021196724]
Graph-based multi-agent reinforcement learning (MARL) enables coordinated behavior under partial observability by modeling agents as nodes and communication links as edges.<n>We study this bandwidth-limited regime and show that naive dimensionality reduction consistently degrades coordination performance.<n>We introduce Band-constrained Variational Message (BVME), a lightweight module that treats messages as samples from Gaussians regularized via KL divergence to an uninformative prior.<n>BVME achieves comparable or superior performance while using 67--83% fewer message dimensions, with gains most pronounced on sparse graphs where message quality critically impacts coordination.
arXiv Detail & Related papers (2025-12-11T23:56:43Z) - Prediction-Powered Communication with Distortion Guarantees [65.37485275954224]
We study a prediction-powered communication setting, in which devices communicate under zero-delay constraints with strict distortion guarantees.<n>We propose two zero-delay compression algorithms leveraging online conformal prediction to provide per-sequence guarantees on the distortion of reconstructed sequences.<n>Experiments on semantic text compression validate the approach, showing significant bit rate reductions.
arXiv Detail & Related papers (2025-09-29T07:19:39Z) - QuantSparse: Comprehensively Compressing Video Diffusion Transformer with Model Quantization and Attention Sparsification [67.15451442018258]
Diffusion transformers exhibit remarkable video generation capability, yet their prohibitive computational and memory costs hinder practical deployment.<n>Model quantization and attention sparsification are two promising directions for compression, but each alone suffers severe performance degradation under aggressive compression.<n>We propose textbfQuantSparse, a unified framework that integrates model quantization with attention sparsification.
arXiv Detail & Related papers (2025-09-28T06:49:44Z) - The Larger the Merrier? Efficient Large AI Model Inference in Wireless Edge Networks [56.37880529653111]
The demand for large computation model (LAIM) services is driving a paradigm shift from traditional cloud-based inference to edge-based inference for low-latency, privacy-preserving applications.<n>In this paper, we investigate the LAIM-inference scheme, where a pre-trained LAIM is pruned and partitioned into on-device and on-server sub-models for deployment.
arXiv Detail & Related papers (2025-05-14T08:18:55Z) - Diffusion-Driven Semantic Communication for Generative Models with Bandwidth Constraints [66.63250537475973]
This paper introduces a diffusion-driven semantic communication framework with advanced VAE-based compression for bandwidth-constrained generative model.<n>Our experimental results demonstrate significant improvements in pixel-level metrics like peak signal to noise ratio (PSNR) and semantic metrics like learned perceptual image patch similarity (LPIPS)
arXiv Detail & Related papers (2024-07-26T02:34:25Z) - Communication-Efficient Distributed Learning with Local Immediate Error
Compensation [95.6828475028581]
We propose the Local Immediate Error Compensated SGD (LIEC-SGD) optimization algorithm.
LIEC-SGD is superior to previous works in either the convergence rate or the communication cost.
arXiv Detail & Related papers (2024-02-19T05:59:09Z) - Retraining-free Model Quantization via One-Shot Weight-Coupling Learning [41.299675080384]
Mixed-precision quantization (MPQ) is advocated to compress the model effectively by allocating heterogeneous bit-width for layers.
MPQ is typically organized into a searching-retraining two-stage process.
In this paper, we devise a one-shot training-searching paradigm for mixed-precision model compression.
arXiv Detail & Related papers (2024-01-03T05:26:57Z) - Overcoming Distribution Mismatch in Quantizing Image Super-Resolution Networks [53.23803932357899]
quantization leads to accuracy loss in image super-resolution (SR) networks.
Existing works address this distribution mismatch problem by dynamically adapting quantization ranges during test time.
We propose a new quantization-aware training scheme that effectively Overcomes the Distribution Mismatch problem in SR networks.
arXiv Detail & Related papers (2023-07-25T08:50:01Z) - M22: A Communication-Efficient Algorithm for Federated Learning Inspired
by Rate-Distortion [19.862336286338564]
In federated learning, model updates must be compressed so as to minimize the loss in accuracy resulting from a communication constraint.
This paper proposes emph$bf M$-magnitude weighted $L_bf 2$ distortion + $bf 2$ degrees of freedom'' (M22) algorithm, a rate-distortion inspired approach to gradient compression.
arXiv Detail & Related papers (2023-01-23T04:40:01Z) - Compression-aware Projection with Greedy Dimension Reduction for
Convolutional Neural Network Activations [3.6188659868203388]
We propose a compression-aware projection system to improve the trade-off between classification accuracy and compression ratio.
Our test results show that the proposed methods effectively reduce 2.91x5.97x memory access with negligible accuracy drop on MobileNetV2/ResNet18/VGG16.
arXiv Detail & Related papers (2021-10-17T14:02:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.