Fugu-MT 論文翻訳(概要): LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference

論文の概要: LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference

arxiv url: http://arxiv.org/abs/2510.09665v1
Date: Wed, 08 Oct 2025 00:15:04 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 18:06:29.536382
Title: LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference
Title（参考訳）: LMCache: エンタープライズ規模のLLM推論のための効率的なKVキャッシュ層
Authors: Yihua Cheng, Yuhan Liu, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaoting Feng, Yuyang Huang, Samuel Shen, Kuntai Du, Junchen Jiang,
Abstract要約: LMCacheは、オープンソースのKVキャッシュソリューションとして、これまでで最も効率的です。現代のLLMエンジンによって生成されたKVキャッシュを抽出し、格納し、エンジンとクエリ間でKVキャッシュを共有する。 LMCacheとvLLMを組み合わせることで,ワークロードのスループットが最大15倍向上することを示す。
参考スコア（独自算出の注目度）: 27.24239725255976
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Today's LLM inference systems treat individual engines and queries independently for simplicity, but this causes significant resource inefficiencies. While there are proposals to avoid redundant computation by reusing KV caches across queries and to increase GPU utilization by disaggregating a single query to different engines, their promises cannot be realized without efficiently offloading and communicating KV cache across LLM inference engines and queries. We present LMCache, the first and so far the most efficient open-source KV caching solution, which extracts and stores KV caches generated by modern LLM engines (vLLM and SGLang) and shares the KV caches across engines and queries. LMCache exposes KV caches in the LLM engine interface, effectively transforming LLM engines from individual token processors to a collection of engines with KV cache as the storage and communication medium. In particular, it supports both cache offloading (prefix reuse across queries) and prefill-decode disaggregation (cross-engine cache transfer). LMCache's high performance and wide adoption stem from the following contributions: highly optimized KV cache data movement with performance optimizations including batched data movement operations, compute and I/O pipelining; a modular KV cache connector component, decoupling LMCache from the rapid evolution of inference engines; a first-class control API, such as pinning, lookup, cleanup, movement, and compression, for flexible cache orchestration across GPU, CPU, storage, and network layers. Evaluation shows that combining LMCache with vLLM achieves up to 15x improvement in throughput across diverse workloads. With a growing community, LMCache has seen dramatic growth in adoption by enterprise inference systems, which provides valuable lessons for future KV caching solutions. The source code of LMCache is at: https://github.com/LMCache/LMCache.
Abstract（参考訳）: 今日のLLM推論システムは、個別のエンジンとクエリを独立して単純化するが、これはリソースの非効率を著しく引き起こす。クエリをまたいだKVキャッシュの再利用による冗長な計算の回避と、異なるエンジンに単一のクエリを分散させることによるGPU利用率の向上が提案されているが、それらの約束はLLM推論エンジンとクエリをまたいだKVキャッシュの効率的なオフロードと通信なしには実現できない。我々は,最新のLLMエンジン(vLLMとSGLang)が生成するKVキャッシュを抽出,保存し,エンジンとクエリ間でKVキャッシュを共有する,オープンソースのKVキャッシュソリューションであるLMCacheを紹介した。 LMCacheはLLMエンジンインターフェースにKVキャッシュを公開し、LLMエンジンを個々のトークンプロセッサからKVキャッシュを格納および通信媒体とするエンジンのコレクションに効果的に変換する。特に、キャッシュオフロード(クエリ間のプリフィックス再利用)とプリフィル・デコード・デアグリゲーション(クロスエンジンキャッシュ転送)の両方をサポートしている。バッチデータ移動操作、計算処理、I/Oパイプライニングを含むパフォーマンス最適化を備えた、高度に最適化されたKVキャッシュデータムーブメント、モジュール型のKVキャッシュコネクタコンポーネント、推論エンジンの急速な進化からLMCacheを分離する、ピンニング、ルックアップ、クリーンアップ、ムーブメント、圧縮などのファーストクラスのコントロールAPI、GPU、CPU、ストレージ、ネットワーク層間の柔軟なキャッシュオーケストレーションを実現する。 LMCacheとvLLMを組み合わせることで、さまざまなワークロードにおけるスループットが最大で15倍向上することを示している。コミュニティが成長するにつれて、LMCacheは企業推論システムによる採用を劇的に増加させ、将来のKVキャッシュソリューションに価値ある教訓を提供する。 LMCacheのソースコードは以下の通りである。

論文の概要: LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference

関連論文リスト