Related papers: Cache Management for Mixture-of-Experts LLMs -- extended version

Cache Management for Mixture-of-Experts LLMs -- extended version

URL: http://arxiv.org/abs/2509.02408v1
Date: Tue, 02 Sep 2025 15:19:06 GMT
Title: Cache Management for Mixture-of-Experts LLMs -- extended version
Authors: Spyros Angelopoulos, Loris Marchal, Adrien Obrecht, Bertrand Simon,
Abstract summary: Large language models (LLMs) have demonstrated remarkable capabilities across a variety of tasks.<n>One of the main challenges towards the successful deployment of LLMs is memory management.<n>We introduce and study a new paging problem that models expert management optimization.
Score: 29.858964433575906
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across a variety of tasks. One of the main challenges towards the successful deployment of LLMs is memory management, since they typically involve billions of parameters. To this end, architectures based on Mixture-of-Experts have been proposed, which aim to reduce the size of the parameters that are activated when producing a token. This raises the equally critical issue of efficiently managing the limited cache of the system, in that frequently used experts should be stored in the fast cache rather than in the slower secondary memory. In this work, we introduce and study a new paging problem that models expert management optimization. Our formulation captures both the layered architecture of LLMs and the requirement that experts are cached efficiently. We first present lower bounds on the competitive ratio of both deterministic and randomized algorithms, which show that under mild assumptions, LRU-like policies have good theoretical competitive performance. We then propose a layer-based extension of LRU that is tailored to the problem at hand. Extensive simulations on both synthetic datasets and actual traces of MoE usage show that our algorithm outperforms policies for the classic paging problem, such as the standard LRU.

Related papers

URaG: Unified Retrieval and Generation in Multimodal LLMs for Efficient Long Document Understanding [55.45331924836242]
We present URaG, a framework that Unifies Retrieval and Generation within a single MLLM.<n>We show that URaG achieves state-of-the-art performance while reducing computational overhead by 44-56%.
arXiv Detail & Related papers (2025-11-13T17:54:09Z)
Rethinking On-policy Optimization for Query Augmentation [49.87723664806526]
We present the first systematic comparison of prompting-based and RL-based query augmentation across diverse benchmarks.<n>We introduce a novel hybrid method, On-policy Pseudo-document Query Expansion (OPQE), which learns to generate a pseudo-document that maximizes retrieval performance.
arXiv Detail & Related papers (2025-10-20T04:16:28Z)
Cluster Topology-Driven Placement of Experts Reduces Network Traffic in MoE Inference [49.141930185079325]
We propose an integer linear program (ILP) that determines the optimal placement of experts, minimizing the expected number of transmissions.<n>We demonstrate that ILP-based placement strategy yields lower network traffic than competitors for small-scale (DeepSeekMoE16B) and large-scale (DeepSeek-R1671B) models.
arXiv Detail & Related papers (2025-08-12T07:08:48Z)
SmartLLMs Scheduler: A Framework for Cost-Effective LLMs Utilization [9.615876932810126]
Large Language Models (LLMs) have shown remarkable capabilities in a variety of software engineering tasks.<n>Existing optimization strategies for deploying LLMs for diverse tasks focus on static scheduling.<n>We propose the SmartLLMs Scheduler (SLS), a dynamic and cost-effective scheduling solution.
arXiv Detail & Related papers (2025-08-05T09:35:52Z)
LLM4Hint: Leveraging Large Language Models for Hint Recommendation in Offline Query Optimization [7.00597706249493]
This paper explores how Large Language Model (LLM) can be incorporated to enhance the generalization of learned phrases.<n>We propose textbfLLM4Hint that leverages moderate-sized backbone LLMs to recommend query optimization hints.
arXiv Detail & Related papers (2025-07-04T08:32:17Z)
Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints [14.341123057506827]
Large Language Models (LLMs) are indispensable in today's applications, but their inference procedure demands significant computational resources.<n>This paper formulates LLM inference optimization as a multi-stage online scheduling problem.<n>We develop a fluid dynamics approximation to provide a tractable benchmark that guides algorithm design.
arXiv Detail & Related papers (2025-04-15T16:00:21Z)
Faster LLM Inference using DBMS-Inspired Preemption and Cache Replacement Policies [9.92327835631428]
This paper first analyzes the LLM inference performance and focuses on a data management issue inside LLM inference.<n>We find that inference systems lack an adequate resource cost model and optimization strategy to schedule requests.<n>We adapt classic database techniques by building cost models for concurrent inference requests and a new cache replacement policy tailored for LLM inference.
arXiv Detail & Related papers (2024-11-12T00:10:34Z)
Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design [59.00758127310582]
We propose a novel framework Read-ME that transforms pre-trained dense LLMs into smaller MoE models. Our approach employs activation sparsity to extract experts. Read-ME outperforms other popular open-source dense models of similar scales.
arXiv Detail & Related papers (2024-10-24T19:48:51Z)
Search for Efficient Large Language Models [52.98684997131108]
Large Language Models (LLMs) have long held sway in the realms of artificial intelligence research. Weight pruning, quantization, and distillation have been embraced to compress LLMs, targeting memory reduction and inference acceleration. Most model compression techniques concentrate on weight optimization, overlooking the exploration of optimal architectures.
arXiv Detail & Related papers (2024-09-25T21:32:12Z)
SelectLLM: Query-Aware Efficient Selection Algorithm for Large Language Models [8.558834738072363]
Large language models (LLMs) have been widely adopted due to their remarkable performance across various applications.<n>These individual LLMs show limitations in generalization and performance on complex tasks due to inherent training biases, model size constraints, and the quality or diversity of pre-training datasets.<n>We introduce SelectLLM, which efficiently directs input queries to the most suitable subset of LLMs from a large pool.
arXiv Detail & Related papers (2024-08-16T06:11:21Z)
Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning that learns the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model.<n>We achieve this by learning an underlying Bernoulli distribution to sample binary pruning masks.<n>Experiments conducted on LLaMA, LLaMA-2, LLaMA-3, Vicuna, and Mistral models demonstrate the promising performance of our method in efficiency and effectiveness.
arXiv Detail & Related papers (2024-06-15T09:31:03Z)
Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models [90.14693869269519]
MoE LLMs can achieve higher performance with fewer parameters, but it is still hard to deploy them due to their immense parameter sizes. This paper mainly aims to enhance the deployment efficiency of MoE LLMs by introducing plug-and-play expert-level sparsification techniques.
arXiv Detail & Related papers (2024-02-22T18:56:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.