Related papers: ProMoE: Fast MoE-based LLM Serving using Proactive Caching

Related papers

Sparse-dLLM: Accelerating Diffusion LLMs with Dynamic Cache Eviction [58.044803442346115]
Diffusion Large Language Models (dLLMs) enable breakthroughs in reasoning and parallel decoding but suffer from prohibitive computational complexity and memory overhead during inference.<n>We propose Sparse-dLLM, the first training-free framework integrating dynamic cache eviction with sparse attention via delayed bidirectional sparse caching.
arXiv Detail & Related papers (2025-08-04T16:14:03Z)
SpecOffload: Unlocking Latent GPU Capacity for LLM Inference on Resource-Constrained Devices [16.407669822378487]
SpecOffload embeds speculative decoding into offloading.<n>Compared to the best baseline, SpecOffload improves GPU core utilization by 4.49x and boosts inference throughput by 2.54x.
arXiv Detail & Related papers (2025-05-15T13:10:31Z)
MOM: Memory-Efficient Offloaded Mini-Sequence Inference for Long Context Language Models [72.61076288351201]
We propose Memory-efficient Offloaded Mini-sequence Inference (MOM) MOM partitions critical layers into smaller "mini-sequences" and integrates seamlessly with KV cache offloading. On Meta-Llama-3.2-8B, MOM extends the maximum context length from 155k to 455k tokens on a single A100 80GB GPU.
arXiv Detail & Related papers (2025-04-16T23:15:09Z)
QuantCache: Adaptive Importance-Guided Quantization with Hierarchical Latent and Layer Caching for Video Generation [84.91431271257437]
Diffusion Transformers (DiTs) have emerged as a dominant architecture in video generation. DiTs come with significant drawbacks, including increased computational and memory costs. We propose QuantCache, a novel training-free inference acceleration framework.
arXiv Detail & Related papers (2025-03-09T10:31:51Z)
fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving [9.956997242640728]
fMoE is a fine-grained expert offloading system for MoE serving. We show that fMoE reduces inference latency by 47% and improves expert hit rate by 36% over state-of-the-art solutions.
arXiv Detail & Related papers (2025-02-07T22:51:17Z)
Ca2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing [66.66090399385304]
Ca2-VDM is an efficient autoregressive VDM with Causal generation and Cache sharing. For causal generation, it introduces unidirectional feature computation, which ensures that the cache of conditional frames can be precomputed in previous autoregression steps. For cache sharing, it shares the cache across all denoising steps to avoid the huge cache storage cost.
arXiv Detail & Related papers (2024-11-25T13:33:41Z)
InstCache: A Predictive Cache for LLM Serving [9.878166964839512]
We propose to predict user-instructions by an instruction-aligned LLM and store them in a predictive cache, so-called InstCache. Experimental results show that InstCache can achieve up to 51.34% hit rate on LMSys dataset, which corresponds to a 2x speedup, at a memory cost of only 4.5GB.
arXiv Detail & Related papers (2024-11-21T03:52:41Z)
HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference [54.40808356999408]
We present HOBBIT, a mixed precision expert offloading system to enable flexible and efficient MoE inference. Our key insight is that dynamically replacing less critical cache-miss experts with low precision versions can substantially reduce expert-loading latency. HOBBIT achieves up to a 9.93x speedup in decoding compared to state-of-the-art MoE offloading systems.
arXiv Detail & Related papers (2024-11-03T04:25:46Z)
MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models [58.3342517278868]
This paper describes the design of Mixed-precision AutoRegressive LINear kernels. It shows that batchsizes up to 16-32 can be supported with close to maximum ($4times$) quantization speedup. MarLIN accomplishes this via a combination of techniques, such as asynchronous memory access, complex task scheduling and pipelining.
arXiv Detail & Related papers (2024-08-21T16:10:41Z)
ThinK: Thinner Key Cache by Query-Driven Pruning [63.13363917871414]
Large Language Models (LLMs) have revolutionized the field of natural language processing, achieving unprecedented performance across a variety of applications. This paper focuses on the long-context scenario, addressing the inefficiencies in KV cache memory consumption during inference. We propose ThinK, a novel query-dependent KV cache pruning method designed to minimize attention weight loss while selectively pruning the least significant channels.
arXiv Detail & Related papers (2024-07-30T17:59:08Z)
Efficient Inference of Vision Instruction-Following Models with Elastic Cache [76.44955111634545]
We introduce Elastic Cache, a novel strategy for efficient deployment of instruction-following large vision-language models. We propose an importance-driven cache merging strategy to prune redundancy caches. For instruction encoding, we utilize the frequency to evaluate the importance of caches. Results on a range of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation.
arXiv Detail & Related papers (2024-07-25T15:29:05Z)
vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests. Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z)
MoNDE: Mixture of Near-Data Experts for Large-Scale Sparse Models [15.346491299728463]
MoNDE reduces the volume of MoE parameter movement by transferring only the $textithot$ experts to the GPU. MoNDE enables far more communication-efficient MoE inference, thereby resulting in substantial speedups.
arXiv Detail & Related papers (2024-05-29T07:23:29Z)
CORM: Cache Optimization with Recent Message for Large Language Model Inference [57.109354287786154]
We introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint. CORM, a KV cache eviction policy, dynamically retains essential key-value pairs for inference without the need for model fine-tuning. Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70% with negligible performance degradation across six tasks in LongBench.
arXiv Detail & Related papers (2024-04-24T16:11:54Z)
MoE-Infinity: Offloading-Efficient MoE Model Serving [15.826989637041907]
MoE-Infinity is an offloading-efficient serving system for sparse mixture-of-experts (MoE) models. To optimize offloading, MoE-Infinity achieves novel request-level tracing for expert activation. MoE-Infinity exhibits superior latency performance, providing 2-20X improvements when serving various MoE models.
arXiv Detail & Related papers (2024-01-25T18:07:50Z)
Efficient Memory Management for Large Language Model Serving with PagedAttention [44.70922552274376]
High throughput serving of large language models (LLMs) requires sufficiently many requests at a time. Existing systems struggle because the key-value cache ( KV cache) memory for each request is huge and grows and shrinks dynamically. We propose PagedAttention, an attention algorithm inspired by the classical virtual memory and paging techniques in operating systems.
arXiv Detail & Related papers (2023-09-12T12:50:04Z)
Pre-gated MoE: An Algorithm-System Co-Design for Fast and Scalable Mixture-of-Expert Inference [23.207326766883405]
Mixture-of-Experts (MoE) is able to scale its model size without proportionally scaling up its computational requirements. Pre-gated MoE employs our novel pre-gating function which alleviates the dynamic nature of sparse expert activation. We demonstrate that Pre-gated MoE is able to improve performance, reduce GPU memory consumption, while also maintaining the same level of model quality.
arXiv Detail & Related papers (2023-08-23T11:25:37Z)
Training Personalized Recommendation Systems from (GPU) Scratch: Look Forward not Backwards [1.7733623930581417]
Personalized recommendation models (RecSys) are one of the most popular machine learning workload serviced by hyperscalers. A critical challenge of training RecSys is its high memory capacity requirements, reaching hundreds of GBs to TBs of model size. In RecSys, the so-called embedding layers account for the majority of memory usage so current systems employ a hybrid CPU-GPU design to have the large CPU memory store the memory hungry embedding layers.
arXiv Detail & Related papers (2022-05-10T07:05:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.