Fugu-MT 論文翻訳(概要): Toward Robust and Efficient ML-Based GPU Caching for Modern Inference

論文の概要: Toward Robust and Efficient ML-Based GPU Caching for Modern Inference

arxiv url: http://arxiv.org/abs/2509.20979v1
Date: Thu, 25 Sep 2025 10:23:50 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-26 20:58:12.839525
Title: Toward Robust and Efficient ML-Based GPU Caching for Modern Inference
Title（参考訳）: モダン推論のためのロバストかつ効率的なMLベースGPUキャッシングに向けて
Authors: Peng Chen, Jiaji Zhang, Hailiang Zhao, Yirong Zhang, Jiahong Yu, Xueyan Tang, Yixuan Wang, Hao Li, Jianping Zou, Gang Xiong, Kingsum Chow, Shuibing He, Shuiguang Deng,
Abstract要約: 学習ベースのGPUキャッシングのためのフレームワークであるtextscLCRを提案する。そのコアアルゴリズムであるtextscLARUは、マシン学習した予測でtextscLRUを強化し、オンラインエラー推定を通じて予測精度に動的に適応する。実験では、スループットを最大24.2%改善し、P99 TTFTを最大28.3%削減し、広く使われている推論システムを上回っている。
参考スコア（独自算出の注目度）: 28.13206649836587
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In modern GPU inference, cache efficiency remains a major bottleneck. In recommendation models, embedding hit rates largely determine throughput, while in large language models, KV-cache misses substantially increase time-to-first-token (TTFT). Heuristic policies such as \textsc{LRU} often struggle under structured access patterns. Learning-based approaches are promising, but in practice face two major limitations: they degrade sharply when predictions are inaccurate, or they gain little even with accurate predictions due to conservative designs. Some also incur high overhead, further limiting practicality. We present \textsc{LCR}, a practical framework for learning-based GPU caching that delivers performance gains while ensuring robustness and efficiency. Its core algorithm, \textsc{LARU}, enhances \textsc{LRU} with machine-learned predictions and dynamically adapts to prediction accuracy through online error estimation. When predictions are accurate, \textsc{LARU} achieves near-optimal performance. With inaccurate predictions, it degrades gracefully to near-\textsc{LRU} performance. With \textsc{LCR}, we bridge the gap between empirical progress and theoretical advances in learning-based caching. Experiments show that \textsc{LCR} delivers consistent gains under realistic conditions. In DLRM and LLM scenarios, it improves throughput by up to 24.2\% and reduces P99 TTFT by up to 28.3\%, outperforming widely used inference systems. Even under poor predictions, its performance remains stable, demonstrating practical robustness.
Abstract（参考訳）: 最近のGPU推論では、キャッシュ効率は依然として大きなボトルネックである。推奨モデルでは、埋め込みヒットレートがスループットを決定するのに対して、大きな言語モデルでは、KV-cacheミスはTTFT(Time-to-first-token)を大幅に増加させる。 textsc{LRU}のようなヒューリスティックなポリシーは、しばしば構造化されたアクセスパターンの下で苦労する。学習ベースのアプローチは有望だが、実際には2つの大きな制限に直面している。オーバーヘッドも高く、実用性も制限されている。我々は,堅牢性と効率性を確保しつつ,パフォーマンス向上を実現するための,学習ベースのGPUキャッシングのための実践的フレームワークである‘textsc{LCR} を提示する。そのコアアルゴリズムである \textsc{LARU} は、マシン学習した予測で \textsc{LRU} を強化し、オンラインエラー推定によって予測精度に動的に適応する。予測が正確であれば、 \textsc{LARU} は最適に近い性能を達成する。不正確な予測では、ほぼ\textsc{LRU}のパフォーマンスに優雅に低下する。 textsc{LCR}では、経験的進歩と学習に基づくキャッシュ理論の進歩のギャップを埋める。実験により, 現実的な条件下では, textsc{LCR} が一貫した利得をもたらすことが示された。 DLRMとLLMのシナリオでは、スループットを最大24.2\%改善し、P99 TTFTを最大28.3\%削減する。予測が悪くても、その性能は安定しており、実用的な堅牢性を示している。

論文の概要: Toward Robust and Efficient ML-Based GPU Caching for Modern Inference

関連論文リスト