ProMoE: Fast MoE-based LLM Serving using Proactive Caching
- URL: http://arxiv.org/abs/2410.22134v1
- Date: Tue, 29 Oct 2024 15:31:27 GMT
- Title: ProMoE: Fast MoE-based LLM Serving using Proactive Caching
- Authors: Xiaoniu Song, Zihang Zhong, Rong Chen,
- Abstract summary: Mixture-of-Experts (MoE) models help mitigate this issue by activating only a subset of the model's parameters during computation.
We propose ProMoE, a novel proactive caching system that leverages intermediate model results to predict subsequent parameter usage.
Our evaluations demonstrate that ProMoE achieves an average speedup of 2.13x and 2.84x in the prefill and decode stages respectively.
- Score: 2.041412657843408
- License:
- Abstract: The promising applications of large language models are often constrained by the limited GPU memory capacity available on edge devices. Mixture-of-Experts (MoE) models help mitigate this issue by activating only a subset of the model's parameters during computation, allowing the unused parameters to be offloaded to host memory and reducing overall GPU memory demand. However, existing cache-based offloading solutions handle cache misses reactively and significantly impact system performance. In this paper, we propose ProMoE, a novel proactive caching system that leverages intermediate model results to predict subsequent parameter usage. By proactively fetching experts in advance, ProMoE removes the loading time from the critical path and diminishes the performance overhead of offloading. Our evaluations demonstrate that ProMoE achieves an average speedup of 2.13x and 2.84x in the prefill and decode stages respectively, compared to existing offloading solutions.
Related papers
- HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference [54.40808356999408]
We present HOBBIT, a mixed precision expert offloading system to enable flexible and efficient MoE inference.
Our key insight is that dynamically replacing less critical cache-miss experts with low precision versions can substantially reduce expert-loading latency.
HOBBIT achieves up to a 9.93x speedup in decoding compared to state-of-the-art MoE offloading systems.
arXiv Detail & Related papers (2024-11-03T04:25:46Z) - MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models [58.3342517278868]
This paper describes the design of Mixed-precision AutoRegressive LINear kernels.
It shows that batchsizes up to 16-32 can be supported with close to maximum ($4times$) quantization speedup.
MarLIN accomplishes this via a combination of techniques, such as asynchronous memory access, complex task scheduling and pipelining.
arXiv Detail & Related papers (2024-08-21T16:10:41Z) - ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications.
This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference.
We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z) - Efficient Inference of Vision Instruction-Following Models with Elastic Cache [76.44955111634545]
We introduce Elastic Cache, a novel strategy for efficient deployment of instruction-following large vision-language models.
We propose an importance-driven cache merging strategy to prune redundancy caches.
For instruction encoding, we utilize the frequency to evaluate the importance of caches.
Results on a range of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation.
arXiv Detail & Related papers (2024-07-25T15:29:05Z) - vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z) - MoNDE: Mixture of Near-Data Experts for Large-Scale Sparse Models [15.346491299728463]
MoNDE reduces the volume of MoE parameter movement by transferring only the $textithot$ experts to the GPU.
MoNDE enables far more communication-efficient MoE inference, thereby resulting in substantial speedups.
arXiv Detail & Related papers (2024-05-29T07:23:29Z) - CORM: Cache Optimization with Recent Message for Large Language Model Inference [57.109354287786154]
We introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint.
CORM, a KV cache eviction policy, dynamically retains essential key-value pairs for inference without the need for model fine-tuning.
Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70% with negligible performance degradation across six tasks in LongBench.
arXiv Detail & Related papers (2024-04-24T16:11:54Z) - MoE-Infinity: Offloading-Efficient MoE Model Serving [15.826989637041907]
MoE-Infinity is an offloading-efficient serving system for sparse mixture-of-experts (MoE) models.
To optimize offloading, MoE-Infinity achieves novel request-level tracing for expert activation.
MoE-Infinity exhibits superior latency performance, providing 2-20X improvements when serving various MoE models.
arXiv Detail & Related papers (2024-01-25T18:07:50Z) - Efficient Memory Management for Large Language Model Serving with
PagedAttention [44.70922552274376]
High throughput serving of large language models (LLMs) requires sufficiently many requests at a time.
Existing systems struggle because the key-value cache ( KV cache) memory for each request is huge and grows and shrinks dynamically.
We propose PagedAttention, an attention algorithm inspired by the classical virtual memory and paging techniques in operating systems.
arXiv Detail & Related papers (2023-09-12T12:50:04Z) - Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference [23.207326766883405]
Mixture-of-Experts (MoE) is able to scale its model size without proportionally scaling up its computational requirements.
Pre-gated MoE employs our novel pre-gating function which alleviates the dynamic nature of sparse expert activation.
We demonstrate that Pre-gated MoE is able to improve performance, reduce GPU memory consumption, while also maintaining the same level of model quality.
arXiv Detail & Related papers (2023-08-23T11:25:37Z) - Training Personalized Recommendation Systems from (GPU) Scratch: Look
Forward not Backwards [1.7733623930581417]
Personalized recommendation models (RecSys) are one of the most popular machine learning workload serviced by hyperscalers.
A critical challenge of training RecSys is its high memory capacity requirements, reaching hundreds of GBs to TBs of model size.
In RecSys, the so-called embedding layers account for the majority of memory usage so current systems employ a hybrid CPU-GPU design to have the large CPU memory store the memory hungry embedding layers.
arXiv Detail & Related papers (2022-05-10T07:05:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.