Related papers: SlimCaching: Edge Caching of Mixture-of-Experts for Distributed Inference

SlimCaching: Edge Caching of Mixture-of-Experts for Distributed Inference

URL: http://arxiv.org/abs/2507.06567v1
Date: Wed, 09 Jul 2025 05:43:43 GMT
Title: SlimCaching: Edge Caching of Mixture-of-Experts for Distributed Inference
Authors: Qian Chen, Xianhao Chen, Kaibin Huang,
Abstract summary: Mixture-of-Experts (MoE) models activate only a small subset of relevant experts per input.<n>The sheer number of expert networks in an MoE model introduces a significant storage burden for an edge device.<n>We propose a greedy decomposition method to decompose the original problem into a series of subproblems.
Score: 29.49615352723995
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Mixture-of-Experts (MoE) models improve the scalability of large language models (LLMs) by activating only a small subset of relevant experts per input. However, the sheer number of expert networks in an MoE model introduces a significant storage burden for an edge device. To address this challenge, we consider a scenario where experts are dispersed within an edge network for distributed inference. Based on the popular Top-$K$ expert selection strategy, we formulate a latency minimization problem by optimizing expert caching on edge servers under storage constraints. When $K=1$, the problem reduces to a monotone submodular maximization problem with knapsack constraints, for which we design a greedy-based algorithm with a $(1 - 1/e)$-approximation guarantee. For the general case where $K\geq1$, expert co-activation within the same MoE layer introduces non-submodularity, causing greedy methods to be ineffective. To tackle this issue, we propose a successive greedy decomposition method to decompose the original problem into a series of subproblems, with each being solved by a dynamic programming approach. Furthermore, we design an accelerated algorithm based on the max-convolution technique to obtain the approximate solution with a provable guarantee in polynomial time. Simulation results on various MoE models demonstrate that our method significantly reduces inference latency compared to existing baselines.

Related papers

Near-Optimal Online Learning for Multi-Agent Submodular Coordination: Tight Approximation and Communication Efficiency [52.60557300927007]
We present a $textbfMA-OSMA$ algorithm to transfer the discrete submodular problem into a continuous optimization.<n>We also introduce a projection-free $textbfMA-OSEA$ algorithm, which effectively utilizes the KL divergence by mixing a uniform distribution.<n>Our algorithms significantly improve the $(frac11+c)$-approximation provided by the state-of-the-art OSG algorithm.
arXiv Detail & Related papers (2025-02-07T15:57:56Z)
A single-loop SPIDER-type stochastic subgradient method for expectation-constrained nonconvex nonsmooth optimization [17.25924791071807]
We present a novel type of subgradient algorithm for complex constraints.<n>We show that our method is significantly faster than two-of-the-art algorithms.
arXiv Detail & Related papers (2025-01-31T15:18:52Z)
Fast Semisupervised Unmixing Using Nonconvex Optimization [80.11512905623417]
We introduce a novel convex convex model for semi/library-based unmixing. We demonstrate the efficacy of Alternating Methods of sparse unsupervised unmixing.
arXiv Detail & Related papers (2024-01-23T10:07:41Z)
Multi-Agent Bayesian Optimization with Coupled Black-Box and Affine Constraints [21.38692458445459]
We study the problem of distributed multi-agent Bayesian optimization with both coupled black-box constraints and known affine constraints. A primal-dual distributed algorithm is proposed that achieves similar regret/violation bounds as those in the single-agent case.
arXiv Detail & Related papers (2023-10-02T08:07:36Z)
Distributed Extra-gradient with Optimal Complexity and Communication Guarantees [60.571030754252824]
We consider monotone variational inequality (VI) problems in multi-GPU settings where multiple processors/workers/clients have access to local dual vectors. Extra-gradient, which is a de facto algorithm for monotone VI problems, has not been designed to be communication-efficient. We propose a quantized generalized extra-gradient (Q-GenX), which is an unbiased and adaptive compression method tailored to solve VIs.
arXiv Detail & Related papers (2023-08-17T21:15:04Z)
On Optimal Caching and Model Multiplexing for Large Model Inference [66.50550915522551]
Large Language Models (LLMs) and other large foundation models have achieved noteworthy success, but their size exacerbates existing resource consumption and latency challenges. We study two approaches for mitigating these challenges: employing a cache to store previous queries and learning a model multiplexer to choose from an ensemble of models for query processing.
arXiv Detail & Related papers (2023-06-03T05:01:51Z)
Diffusion models as plug-and-play priors [98.16404662526101]
We consider the problem of inferring high-dimensional data $mathbfx$ in a model that consists of a prior $p(mathbfx)$ and an auxiliary constraint $c(mathbfx,mathbfy)$. The structure of diffusion models allows us to perform approximate inference by iterating differentiation through the fixed denoising network enriched with different amounts of noise.
arXiv Detail & Related papers (2022-06-17T21:11:36Z)
Fast Adaptive Non-Monotone Submodular Maximization Subject to a Knapsack Constraint [13.357957711519134]
Constrained submodular problems encompass a variety of applications, including personalized recommendation, team formation, and revenueimation via viral marketing. We present a simple greedy algorithm that achieves a $5.83 randomized approximation and runs in $O(n log n)$ prohibitively time time i.e., at least factor $n$ faster than other state-of-the-art algorithms. There, we obtain a 9-approximation, which is the first constant approximation for non-monotone$ objectives. Experimental evaluation of our algorithms showcases their improved performance on real and synthetic data.
arXiv Detail & Related papers (2020-07-09T18:15:01Z)
Submodular Bandit Problem Under Multiple Constraints [8.100450025624443]
We introduce a submodular bandit problem under the intersection of $l$ knapsacks and a $k$-system constraint. To solve this problem, we propose a non-greedy algorithm that adaptively focuses on a standard or modified upper-confidence bound. We provide a high-probability upper bound of an approximation regret, where the approximation ratio matches that of a fast algorithm.
arXiv Detail & Related papers (2020-06-01T01:28:44Z)
Artificial Intelligence Assisted Collaborative Edge Caching in Small Cell Networks [19.605382256630538]
This paper considers heterogeneous content preference of the users with heterogeneous caching models at the edge nodes. We propose a modified particle swarm optimization (M-PSO) algorithm that efficiently solves the complex constraint problem in a reasonable time.
arXiv Detail & Related papers (2020-05-16T10:39:46Z)
GACEM: Generalized Autoregressive Cross Entropy Method for Multi-Modal Black Box Constraint Satisfaction [69.94831587339539]
We present a modified Cross-Entropy Method (CEM) that uses a masked auto-regressive neural network for modeling uniform distributions over the solution space. Our algorithm is able to express complicated solution spaces, thus allowing it to track a variety of different solution regions.
arXiv Detail & Related papers (2020-02-17T20:21:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.