Fugu-MT 論文翻訳(概要): FineServe: Precision-Aware KV Slab and Two-Level Scheduling for Heterogeneous Precision LLM Serving

論文の概要: FineServe: Precision-Aware KV Slab and Two-Level Scheduling for Heterogeneous Precision LLM Serving

arxiv url: http://arxiv.org/abs/2509.06261v1
Date: Mon, 08 Sep 2025 00:57:50 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-09 14:07:03.92536
Title: FineServe: Precision-Aware KV Slab and Two-Level Scheduling for Heterogeneous Precision LLM Serving
Title（参考訳）: FineServe: 高精度KVスラブと2レベルスケジューリング
Authors: Kyungmin Bin, Seungbeom Choi, Jimyoung Son, Jieun Choi, Daseul Bae, Daehyeon Baek, Kihyo Moon, Minsung Jang, Hyojung Lee,
Abstract要約: FineServeは、混合精度の大規模言語モデルのための推論機能フレームワークである。 FineServeは、最先端のGPU共有システムと比較して最大2.2倍のSLO達成率と1.8倍のトークン生成スループットを実現している。
参考スコア（独自算出の注目度）: 2.141726730716452
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Recent advances in Post-Training Quantization (PTQ) techniques have significantly increased demand for serving quantized large language models (LLMs), enabling higher throughput and substantially reduced memory usage with minimal accuracy loss. Quantized models address memory constraints in LLMs and enhance GPU resource utilization through efficient GPU sharing. However, quantized models have smaller KV block sizes than non-quantized models, causing limited memory efficiency due to memory fragmentation. Also, distinct resource usage patterns between quantized and non-quantized models require efficient scheduling to maximize throughput. To address these challenges, we propose FineServe, an inference serving framework for mixed-precision LLMs. FineServe's key contributions include: (1) KV Slab, a precision-aware adaptive memory management technique dynamically allocating KV cache based on model quantization characteristics, significantly reducing GPU memory fragmentation, and (2) a two-level scheduling framework comprising a global scheduler that places models to GPUs based on request rates, latency SLOs, and memory constraints and efficiency, and a local scheduler that adaptively adjusts batch sizes according to real-time request fluctuations. Experimental results demonstrate that FineServe achieves up to 2.2x higher SLO attainment and 1.8x higher token generation throughput compared to the state-of-the-art GPU sharing systems.
Abstract（参考訳）: 近年のPTQ(Post-Training Quantization)技術の進歩により、量子化大言語モデル(LLM)の需要が大幅に増加し、スループットの向上とメモリ使用量の削減が可能になった。量子モデルは、LLMのメモリ制約に対処し、効率的なGPU共有を通じてGPUリソースの利用を向上させる。しかし、量子化モデルは非量子化モデルよりもKVブロックサイズが小さく、メモリの断片化によってメモリ効率が制限される。また、量子化モデルと非量子化モデルの間で異なるリソース利用パターンはスループットを最大化するために効率的なスケジューリングを必要とする。これらの課題に対処するため,混合精度 LLM のための推論サービスフレームワークである FineServe を提案する。 FineServeの主な貢献は、(1) モデル量子化特性に基づいてKVキャッシュを動的に割当する精度適応型メモリ管理技術であるKV Slab、(2) 要求レート、レイテンシSLO、メモリ制約と効率に基づいてGPUにモデルを配置するグローバルスケジューラと、リアルタイムの要求変動に応じてバッチサイズを適応的に調整するローカルスケジューラである。実験により、FinServeは最先端のGPU共有システムと比較して最大2.2倍のSLO達成率と1.8倍のトークン生成スループットを達成することが示された。

論文の概要: FineServe: Precision-Aware KV Slab and Two-Level Scheduling for Heterogeneous Precision LLM Serving

関連論文リスト