Fugu-MT 論文翻訳(概要): Requests of a Feather Must Flock Together: Batch Size vs. Prefix Homogeneity in LLM Inference

論文の概要: Requests of a Feather Must Flock Together: Batch Size vs. Prefix Homogeneity in LLM Inference

arxiv url: http://arxiv.org/abs/2605.06046v1
Date: Thu, 07 May 2026 11:34:10 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-08 22:27:11.726281
Title: Requests of a Feather Must Flock Together: Batch Size vs. Prefix Homogeneity in LLM Inference
Title（参考訳）: LLM推論におけるFather Must Flockの要求:バッチサイズと修正前均一性
Authors: Saksham Rathi, Preeti, Mythili Vutukuru,
Abstract要約: 大規模言語モデルにおける自動回帰トークン生成には、すべての前のトークンのキーと値テンソル(KVキャッシュ)を"到着"する必要がある。以前の作業は、複数のリクエストを合わせて、このデコードプロセスの効率を改善することを目的としていた。高速なプレフィックス検出と効率的な要求選択を可能にする軽量なデータ構造であるChunked Hash Tree(CHT)を紹介する。
参考スコア（独自算出の注目度）: 2.752817022620644
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Auto-regressive token generation in large language models is memory-bound because it requires "attending to" key and value tensors (KV cache) of all previous tokens. Prior work aims to improve the efficiency of this decode process by batching multiple requests together, and maximizing batch size subject to GPU memory constraints. The key observation of our work is that with prefix-sharing workloads, smaller, prefix-homogeneous batches -- where all requests share a common prefix -- can achieve higher decode throughput than larger, heterogeneous batches, due to better spatial and temporal locality during KV cache accesses. However, prefix-aware schedulers in state-of-the-art inference engines maximize prefix reuse within a batch only to reduce KV cache memory footprint, but do not stop batch formation at smaller homogeneous batches that could have performed better. Further, we show that shared prefix detection in existing schedulers relies on radix-tree traversals, incurring substantial CPU overhead that is often comparable to GPU execution time. This paper presents Feather, a prefix-aware scheduler that uses reinforcement learning (RL) to learn the optimal tradeoff between batch size and prefix homogeneity. We also introduce Chunked Hash Tree (CHT), a lightweight data structure that enables fast prefix detection and efficient request selection for the RL scheduler, avoiding expensive tree traversals. We integrate Feather into vLLM and SGLang, and our evaluation shows that Feather achieves 2--10$\times$ higher end-to-end throughput as compared to existing schedulers, while doing no worse than the status quo when the workload does not have enough prefix sharing. Feather achieves these gains by reducing the total number of KV cache accesses, surpassing the performance of prefix-aware attention kernels that have the same goal.
Abstract（参考訳）: 大規模言語モデルにおける自動回帰トークン生成は、すべての以前のトークンのキーと値テンソル(KVキャッシュ)を必要とするため、メモリバウンドである。以前の作業は、複数のリクエストをまとめてバッチ化し、GPUメモリ制約によるバッチサイズを最大化することで、このデコードプロセスの効率を改善することを目的としていた。私たちの作業における重要な観察は、プレフィックス共有ワークロードでは、KVキャッシュアクセス時の空間的および時間的局所性の向上により、すべてのリクエストが共通のプレフィックスを共有するような、より小さなプレフィックス均質なバッチが、より大きな異種バッチよりも高いデコードスループットを達成できるということです。しかし、最先端推論エンジンのプレフィックス対応スケジューラは、KVキャッシュメモリのフットプリントを減らすためだけにバッチ内でプレフィックスの再利用を最大化するが、より優れた性能を持つより小さな同種バッチでのバッチ生成を止めることはできない。さらに、既存のスケジューラにおける共有プレフィックス検出は、Radix-treeトラバーサルに依存しており、GPUの実行時間に匹敵するCPUオーバーヘッドが発生することを示す。本稿では、強化学習(RL)を用いて、バッチサイズとプレフィックス均質との間の最適なトレードオフを学習するプレフィックス対応スケジューラであるFeatherを提案する。また、高速プレフィックス検出とRLスケジューラの効率的な要求選択を可能にする軽量データ構造であるChunked Hash Tree (CHT)を導入し、高額なツリートラバースを回避する。 We integrate Feather into vLLM and SGLang, and our evaluation shows that Feather achieves 2--10$\times$ higher end-to-end throughput than existing schedulers, while did no worse than the status quo if the workload have enough prefix sharing。また、KVキャッシュアクセスの総数を減らし、同じ目標を持つプレフィックス対応アテンションカーネルのパフォーマンスを上回り、これらのゲインを達成する。

論文の概要: Requests of a Feather Must Flock Together: Batch Size vs. Prefix Homogeneity in LLM Inference

関連論文リスト