Related papers: Bandwidth-Efficient Adaptive Mixture-of-Experts via Low-Rank Compensation

Bandwidth-Efficient Adaptive Mixture-of-Experts via Low-Rank Compensation

URL: http://arxiv.org/abs/2512.17073v1
Date: Thu, 18 Dec 2025 21:15:54 GMT
Title: Bandwidth-Efficient Adaptive Mixture-of-Experts via Low-Rank Compensation
Authors: Zhenyu Liu, Yunzhen Liu, Zehao Fan, Garrett Gagnon, Yayue Hou, Nan Wu, Yangwook Kang, Liu Liu,
Abstract summary: We present Bandwidth-Efficient Adaptive Mixture-of-Experts via Low-Rank Compensation.<n>Our method delivers a superior bandwidth-accuracy trade-off and improved throughput.
Score: 11.078693613992556
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Mixture-of-Experts (MoE) models scale capacity via sparse activation but stress memory and bandwidth. Offloading alleviates GPU memory by fetching experts on demand, yet token-level routing causes irregular transfers that make inference I/O-bound. Static uniform quantization reduces traffic but degrades accuracy under aggressive compression by ignoring expert heterogeneity. We present Bandwidth-Efficient Adaptive Mixture-of-Experts via Low-Rank Compensation, which performs router-guided precision restoration using precomputed low-rank compensators. At inference time, our method transfers compact low-rank factors with Top-n (n<k) experts per token and applies compensation to them, keeping others low-bit. Integrated with offloading on GPU and GPU-NDP systems, our method delivers a superior bandwidth-accuracy trade-off and improved throughput.

Related papers

Squeezing More from the Stream : Learning Representation Online for Streaming Reinforcement Learning [14.799267729619428]
In streaming Reinforcement Learning (RL), transitions are observed and discarded immediately after a single update.<n>We propose extending Self-Predictive Representations (SPR) to the streaming pipeline to maximize the utility of every observed frame.<n>We show that our method learns significantly richer representations, bridging the performance gap caused by the absence of a replay buffer.
arXiv Detail & Related papers (2026-02-10T04:06:32Z)
Generalizing GNNs with Tokenized Mixture of Experts [75.8310720413187]
We show that improving stability requires reducing reliance on shift-sensitive features, leaving an irreducible worst-case generalization floor.<n>We propose STEM-GNN, a pretrain-then-finetune framework with a mixture-of-experts encoder for diverse computation paths.<n>Across nine node, link, and graph benchmarks, STEM-GNN achieves a stronger three-way balance, improving robustness to degree/homophily shifts and to feature/edge corruptions while remaining competitive on clean graphs.
arXiv Detail & Related papers (2026-02-09T22:48:30Z)
Bandwidth-constrained Variational Message Encoding for Cooperative Multi-agent Reinforcement Learning [23.866517021196724]
Graph-based multi-agent reinforcement learning (MARL) enables coordinated behavior under partial observability by modeling agents as nodes and communication links as edges.<n>We study this bandwidth-limited regime and show that naive dimensionality reduction consistently degrades coordination performance.<n>We introduce Band-constrained Variational Message (BVME), a lightweight module that treats messages as samples from Gaussians regularized via KL divergence to an uninformative prior.<n>BVME achieves comparable or superior performance while using 67--83% fewer message dimensions, with gains most pronounced on sparse graphs where message quality critically impacts coordination.
arXiv Detail & Related papers (2025-12-11T23:56:43Z)
Prediction-Powered Communication with Distortion Guarantees [65.37485275954224]
We study a prediction-powered communication setting, in which devices communicate under zero-delay constraints with strict distortion guarantees.<n>We propose two zero-delay compression algorithms leveraging online conformal prediction to provide per-sequence guarantees on the distortion of reconstructed sequences.<n>Experiments on semantic text compression validate the approach, showing significant bit rate reductions.
arXiv Detail & Related papers (2025-09-29T07:19:39Z)
QuantSparse: Comprehensively Compressing Video Diffusion Transformer with Model Quantization and Attention Sparsification [67.15451442018258]
Diffusion transformers exhibit remarkable video generation capability, yet their prohibitive computational and memory costs hinder practical deployment.<n>Model quantization and attention sparsification are two promising directions for compression, but each alone suffers severe performance degradation under aggressive compression.<n>We propose textbfQuantSparse, a unified framework that integrates model quantization with attention sparsification.
arXiv Detail & Related papers (2025-09-28T06:49:44Z)
The Larger the Merrier? Efficient Large AI Model Inference in Wireless Edge Networks [56.37880529653111]
The demand for large computation model (LAIM) services is driving a paradigm shift from traditional cloud-based inference to edge-based inference for low-latency, privacy-preserving applications.<n>In this paper, we investigate the LAIM-inference scheme, where a pre-trained LAIM is pruned and partitioned into on-device and on-server sub-models for deployment.
arXiv Detail & Related papers (2025-05-14T08:18:55Z)
Diffusion-Driven Semantic Communication for Generative Models with Bandwidth Constraints [66.63250537475973]
This paper introduces a diffusion-driven semantic communication framework with advanced VAE-based compression for bandwidth-constrained generative model.<n>Our experimental results demonstrate significant improvements in pixel-level metrics like peak signal to noise ratio (PSNR) and semantic metrics like learned perceptual image patch similarity (LPIPS)
arXiv Detail & Related papers (2024-07-26T02:34:25Z)
Communication-Efficient Distributed Learning with Local Immediate Error Compensation [95.6828475028581]
We propose the Local Immediate Error Compensated SGD (LIEC-SGD) optimization algorithm. LIEC-SGD is superior to previous works in either the convergence rate or the communication cost.
arXiv Detail & Related papers (2024-02-19T05:59:09Z)
Retraining-free Model Quantization via One-Shot Weight-Coupling Learning [41.299675080384]
Mixed-precision quantization (MPQ) is advocated to compress the model effectively by allocating heterogeneous bit-width for layers. MPQ is typically organized into a searching-retraining two-stage process. In this paper, we devise a one-shot training-searching paradigm for mixed-precision model compression.
arXiv Detail & Related papers (2024-01-03T05:26:57Z)
Overcoming Distribution Mismatch in Quantizing Image Super-Resolution Networks [53.23803932357899]
quantization leads to accuracy loss in image super-resolution (SR) networks. Existing works address this distribution mismatch problem by dynamically adapting quantization ranges during test time. We propose a new quantization-aware training scheme that effectively Overcomes the Distribution Mismatch problem in SR networks.
arXiv Detail & Related papers (2023-07-25T08:50:01Z)
M22: A Communication-Efficient Algorithm for Federated Learning Inspired by Rate-Distortion [19.862336286338564]
In federated learning, model updates must be compressed so as to minimize the loss in accuracy resulting from a communication constraint. This paper proposes emph$bf M$-magnitude weighted $L_bf 2$ distortion + $bf 2$ degrees of freedom'' (M22) algorithm, a rate-distortion inspired approach to gradient compression.
arXiv Detail & Related papers (2023-01-23T04:40:01Z)
Compression-aware Projection with Greedy Dimension Reduction for Convolutional Neural Network Activations [3.6188659868203388]
We propose a compression-aware projection system to improve the trade-off between classification accuracy and compression ratio. Our test results show that the proposed methods effectively reduce 2.91x5.97x memory access with negligible accuracy drop on MobileNetV2/ResNet18/VGG16.
arXiv Detail & Related papers (2021-10-17T14:02:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.