Fugu-MT 論文翻訳(概要): TinyServe: Query-Aware Cache Selection for Efficient LLM Serving

論文の概要: TinyServe: Query-Aware Cache Selection for Efficient LLM Serving

arxiv url: http://arxiv.org/abs/2509.12211v1
Date: Thu, 28 Aug 2025 16:17:18 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-21 06:05:45.801547
Title: TinyServe: Query-Aware Cache Selection for Efficient LLM Serving
Title（参考訳）: TinyServe: 効率的なLLM実行のためのクエリ対応キャッシュ選択
Authors: Dong Liu, Yanxuan Yu,
Abstract要約: 本稿では,大規模言語モデル(LLM)を効率的に提供するためのTinyServeを提案する。 TinyServeは、スポーシティ戦略ときめ細かいインスツルメンテーションでリアルタイムデコーディングを実行する。我々の実験では、TinyServeがtextbf3.4x の高速化と textbf2x のメモリ節約を無視できる精度の低下で実現している。
参考スコア（独自算出の注目度）: 5.216774377033164
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Serving large language models (LLMs) efficiently remains challenging due to the high memory and latency overhead of key-value (KV) cache access during autoregressive decoding. We present \textbf{TinyServe}, a lightweight and extensible serving system for deploying tiny LLMs (e.g., TinyLLaMA, GPT2-345M) with support for structured KV sparsity, plugin-based token selection, and hardware-efficient attention kernels. Unlike prior simulation frameworks, TinyServe executes real-time decoding with configurable sparsity strategies and fine-grained instrumentation. To reduce decoding cost, we introduce a \textit{query-aware page selection} mechanism that leverages bounding-box metadata to estimate attention relevance between the query and KV cache blocks. This enables selective KV loading with minimal overhead and no model modifications. Our fused CUDA kernel integrates page scoring, sparse memory access, and masked attention in a single pass. Experiments show that TinyServe achieves up to \textbf{3.4x} speedup and over \textbf{2x} memory savings with negligible accuracy drop. Additional analysis of cache reuse, page hit rate, and multi-GPU scaling confirms its practicality as an efficient system-level design for LLM training and inference research on resource-constrained hardware.
Abstract（参考訳）: 大きな言語モデル(LLM)を効率よく実行することは、自己回帰デコード中のキー値(KV)キャッシュアクセスの高メモリと遅延のオーバーヘッドのため、依然として困難である。我々は,小型LCM(例えば TinyLLaMA, GPT2-345M)をデプロイする軽量で拡張可能なサービスシステムである‘textbf{TinyServe} について述べる。従来のシミュレーションフレームワークとは異なり、TinyServeは設定可能なスパーシティ戦略ときめ細かいインスツルメンテーションでリアルタイムデコーディングを実行する。復号化コストを低減するため,クエリとKVキャッシュブロック間の注意関係を推定するために,バウンディングボックスメタデータを活用した‘textit{query-aware Page selection’機構を導入する。これにより、最小限のオーバーヘッドで選択的なKVローディングが可能で、モデルの変更はない。我々の融合CUDAカーネルはページスコアリングとスパースメモリアクセスを統合し、単一のパスで注意を隠蔽する。実験によると、TinyServe は \textbf{3.4x} のスピードアップと \textbf{2x} のメモリセーブを無視できる精度の低下で達成している。キャッシュ再利用、ページヒット率、マルチGPUスケーリングのさらなる分析により、LLMトレーニングのための効率的なシステムレベルの設計とリソース制約のあるハードウェアの推論研究の実用性が確認される。

論文の概要: TinyServe: Query-Aware Cache Selection for Efficient LLM Serving

関連論文リスト