Related papers: CoMoE: Collaborative Optimization of Expert Aggregation and Offloading for MoE-based LLMs at Edge

CoMoE: Collaborative Optimization of Expert Aggregation and Offloading for MoE-based LLMs at Edge

URL: http://arxiv.org/abs/2508.09208v1
Date: Sun, 10 Aug 2025 14:05:36 GMT
Title: CoMoE: Collaborative Optimization of Expert Aggregation and Offloading for MoE-based LLMs at Edge
Authors: Muqing Li, Ning Li, Xin Yuan, Wenchao Xu, Quan Chen, Song Guo, Haijun Zhang,
Abstract summary: Large language models (LLMs) have driven the adoption of Mixture-of-Experts (MoE) architectures as a promising solution to scale model capacity while controlling computational costs.<n> deploying MoE models in resource-constrained mobile edge computing environments presents significant challenges due to their large memory footprint and dynamic expert activation patterns.<n>We propose a novel dynamic resource-aware collaborative optimization framework that jointly optimize expert aggregation and offloading strategies.
Score: 28.961186929739082
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The proliferation of large language models (LLMs) has driven the adoption of Mixture-of-Experts (MoE) architectures as a promising solution to scale model capacity while controlling computational costs. However, deploying MoE models in resource-constrained mobile edge computing environments presents significant challenges due to their large memory footprint and dynamic expert activation patterns. To address these challenges, we propose a novel dynamic resource-aware collaborative optimization framework that jointly optimizes expert aggregation granularity and offloading strategies based on real-time device resource states, network conditions, and input characteristics in mobile edge environments, denoted as CoMoE. In CoMoE, we first systematically analyze existing expert aggregation techniques, including expert parameter merging,knowledge distillation,and parameter sharing decomposition, identifying their limitations in dynamic mobile environments.We then investigate expert offloading strategies encompassing expert prediction and prefetching, expert caching and scheduling, and multi-tier storage architectures, revealing the interdependencies between routing decisions and offloading performance.The CoMoE incorporates adaptive scheduling mechanisms that respond to user mobility and varying network conditions, enabling efficient MoE deployment across heterogeneous edge devices. Extensive experiments on real mobile edge testbeds demonstrate that CoMoE achieves approximately 70% reduction in memory usage compared to baseline methods, 10.5% lower inference latency than existing expert offloading techniques, while maintaining model performance stability. For large-scale MoE models (e.g,7.4B-parameter Switch-Base-128), the CoMoE reduces memory requirements from 15.6GB to 4.7GB, enabling deployment on resource-constrained mobile edge devices that previously could only support much smaller models.

Related papers

MoE-Prism: Disentangling Monolithic Experts for Elastic MoE Services via Model-System Co-Designs [17.827406818899536]
MoE-Prism is a model-system co-design that transforms rigid MoE models into elastic services.<n>Our evaluation shows that MoE-Prismprovides over 4 times more distinct, stable operating points than the baseline.<n>This allows an AI service to dynamically improve throughput by up to 19.9% under a strict budget or reduce latency by up to 10.36% under limited resources.
arXiv Detail & Related papers (2025-10-22T08:40:01Z)
MOBIUS: Big-to-Mobile Universal Instance Segmentation via Multi-modal Bottleneck Fusion and Calibrated Decoder Pruning [91.90342432541138]
Scaling up model size and training data has advanced foundation models for instance-level perception.<n>High computational cost limits adoption on resource-constrained platforms.<n>We introduce a new benchmark for efficient segmentation on both high-performance computing platforms and mobile devices.
arXiv Detail & Related papers (2025-10-16T18:00:00Z)
CollaPipe: Adaptive Segment-Optimized Pipeline Parallelism for Collaborative LLM Training in Heterogeneous Edge Networks [57.95170323315603]
We introduce CollaPipe, a distributed learning framework that integrates collaborative pipeline parallelism with federated aggregation to support self-evolving networks.<n>In CollaPipe, the encoder part is adaptively partitioned into variable-sized segments and deployed across mobile devices for pipeline-parallel training, while the decoder is deployed on edge servers to handle generative tasks.<n>To enhance training efficiency, we formulate a joint optimization problem that adaptively allocates model segments, micro-batches, bandwidth, and transmission power.
arXiv Detail & Related papers (2025-09-24T07:54:01Z)
Deploying Large AI Models on Resource-Limited Devices with Split Federated Learning [39.73152182572741]
This paper proposes a novel framework, named Quantized Split Federated Fine-Tuning Large AI Model (SFLAM)<n>By partitioning the training load between edge devices and servers, SFLAM can facilitate the operation of large models on devices.<n>SFLAM incorporates quantization management, power control, and bandwidth allocation strategies to enhance training efficiency.
arXiv Detail & Related papers (2025-04-12T07:55:11Z)
DSMoE: Matrix-Partitioned Experts with Dynamic Routing for Computation-Efficient Dense LLMs [70.91804882618243]
This paper proposes DSMoE, a novel approach that achieves sparsification by partitioning pre-trained FFN layers into computational blocks.<n>We implement adaptive expert routing using sigmoid activation and straight-through estimators, enabling tokens to flexibly access different aspects of model knowledge.<n>Experiments on LLaMA models demonstrate that under equivalent computational constraints, DSMoE achieves superior performance compared to existing pruning and MoE approaches.
arXiv Detail & Related papers (2025-02-18T02:37:26Z)
A Survey on Inference Optimization Techniques for Mixture of Experts Models [50.40325411764262]
Large-scale Mixture of Experts (MoE) models offer enhanced model capacity and computational efficiency through conditional computation.<n> deploying and running inference on these models presents significant challenges in computational resources, latency, and energy efficiency.<n>This survey analyzes optimization techniques for MoE models across the entire system stack.
arXiv Detail & Related papers (2024-12-18T14:11:15Z)
HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference [54.40808356999408]
We present HOBBIT, a mixed precision expert offloading system to enable flexible and efficient MoE inference. Our key insight is that dynamically replacing less critical cache-miss experts with low precision versions can substantially reduce expert-loading latency. HOBBIT achieves up to a 9.93x speedup in decoding compared to state-of-the-art MoE offloading systems.
arXiv Detail & Related papers (2024-11-03T04:25:46Z)
Efficient and Effective Weight-Ensembling Mixture of Experts for Multi-Task Model Merging [111.8456671452411]
Multi-task learning (MTL) leverages a shared model to accomplish multiple tasks and facilitate knowledge transfer. We propose a Weight-Ensembling Mixture of Experts (WEMoE) method for multi-task model merging. We show that WEMoE and E-WEMoE outperform state-of-the-art (SOTA) model merging methods in terms of MTL performance, generalization, and robustness.
arXiv Detail & Related papers (2024-10-29T07:16:31Z)
Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design [59.00758127310582]
We propose a novel framework Read-ME that transforms pre-trained dense LLMs into smaller MoE models. Our approach employs activation sparsity to extract experts. Read-ME outperforms other popular open-source dense models of similar scales.
arXiv Detail & Related papers (2024-10-24T19:48:51Z)
AdapMoE: Adaptive Sensitivity-based Expert Gating and Management for Efficient MoE Inference [13.263938935671646]
AdapMoE is an algorithm-system co-design framework for efficient MoE inference. AdapMoE features adaptive expert gating and management to reduce the on-demand loading overheads. We show AdapMoE consistently outperforms existing techniques, reducing the average number of activated experts by 25% and achieving a 1.35x speedup without degradation accuracy.
arXiv Detail & Related papers (2024-08-19T03:27:15Z)
Resource Management for Low-latency Cooperative Fine-tuning of Foundation Models at the Network Edge [35.40849522296486]
Large-scale foundation models (FoMos) can perform human-like intelligence. FoMos need to be adapted to specialized downstream tasks through fine-tuning techniques. We advocate multi-device cooperation within the device-edge cooperative fine-tuning paradigm.
arXiv Detail & Related papers (2024-07-13T12:47:14Z)
SwapMoE: Serving Off-the-shelf MoE-based Large Language Models with Tunable Memory Budget [20.33693233516486]
We introduce SwapMoE, a framework for efficient serving of MoE-based large language models with tunable memory budgets. experiments have shown that SwapMoE can reduce the memory footprint while maintaining reasonable accuracy.
arXiv Detail & Related papers (2023-08-29T05:25:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.