Fugu-MT 論文翻訳(概要): 98$\times$ Faster LLM Routing Without a Dedicated GPU: Flash Attention, Prompt Compression, and Near-Streaming for the vLLM Semantic Router

論文の概要: 98$\times$ Faster LLM Routing Without a Dedicated GPU: Flash Attention, Prompt Compression, and Near-Streaming for the vLLM Semantic Router

arxiv url: http://arxiv.org/abs/2603.12646v1
Date: Fri, 13 Mar 2026 04:33:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-16 17:38:11.905254
Title: 98$\times$ Faster LLM Routing Without a Dedicated GPU: Flash Attention, Prompt Compression, and Near-Streaming for the vLLM Semantic Router
Title（参考訳）: 98$\times$ Faster LLM Routing 専用GPUなし: Flash Attention, Prompt Compression, Near-Streaming for the vLLM Semantic Router
Authors: Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen,
Abstract要約: 本稿では,vLLMセマンティックルータの3つの段階最適化について述べる。 ROCm上のONNX用のカスタムFlashアテンション演算子は、注目メモリを$O(n2)$から$O(n)$に還元する。適応的チャンキングによるニアストリームボディ処理はシリアライズオーバーヘッドをなくす。
参考スコア（独自算出の注目度）: 9.457255218406333
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: System-level routers that intercept LLM requests for safety classification, domain routing, and PII detection must be both fast and operationally lightweight: they should add minimal latency to every request, yet not require a dedicated GPU -- an expensive resource better used for LLM inference itself. When the router co-locates on the same GPU as vLLM serving instances, standard attention's $O(n^2)$ memory makes long-context classification (8K--32K tokens) impossible: at 8K tokens, three concurrent classifiers need ${\sim}$4.5\,GB for attention masks alone, far exceeding the memory left by vLLM. We present three staged optimizations for the vLLM Semantic Router, benchmarked on AMD Instinct MI300X, that solve both the latency and the memory problem. \emph{Stage~1}: a custom CK Flash Attention operator for ONNX Runtime on ROCm reduces attention memory from $O(n^2)$ to $O(n)$ and end-to-end (E2E) latency from 4{,}918\,ms to 127\,ms (\textbf{38.7$\times$}), enabling 8K--32K tokens where SDPA OOMs. \emph{Stage~2}: classical NLP prompt compression (TextRank, position weighting, TF-IDF, and novelty scoring) reduces all inputs to ${\sim}$512 tokens without neural inference, capping both latency and GPU memory at a constant regardless of original prompt length (E2E 127$\to$62\,ms, \textbf{2.0$\times$}). \emph{Stage~3}: near-streaming body processing with adaptive chunking and zero-copy JSON eliminates serialization overhead (E2E 62$\to$50\,ms, \textbf{1.2$\times$}). Cumulatively: \textbf{98$\times$} improvement (4{,}918\,ms to 50\,ms), 16K-token routing in 108\,ms, and a total router GPU footprint under 800\,MB -- small enough to share a GPU with LLM serving and removing the need for a dedicated accelerator. Stage~1 targets AMD ROCm (NVIDIA GPUs already have FlashAttention via cuDNN); Stages~2 and~3 are hardware-agnostic.
Abstract（参考訳）: 安全分類、ドメインルーティング、PII検出のためにLLMリクエストをインターセプトするシステムレベルのルータは、高速かつ運用的に軽量でなければならない。ルータがvLLMサービスインスタンスと同じGPU上に配置されている場合、標準アテンションの$O(n^2)$メモリは、長いコンテキストの分類(8K-32Kトークン)を不可能にする。 AMD Instinct MI300X でベンチマークした vLLM Semantic Router の3つのステージ最適化を行い,レイテンシとメモリの問題の両方を解決する。 \emph{Stage~1}: ROCm上のONNX Runtime用のカスタムCK Flash Attentionオペレータは、注意メモリを$O(n^2)$から$O(n)$に減らし、エンドツーエンド(E2E)レイテンシを4{,}918\,msから127\,ms(\textbf{38.7$\times$})に短縮する。 \emph{Stage~2}: 古典的なNLPプロンプト圧縮(TextRank, position weighting, TF-IDF, and novelty score)は、全ての入力を神経推論なしで${\sim}$512トークンに還元し、元のプロンプト長に関わらずレイテンシとGPUメモリの両方を一定にカプセル化する(E2E 127$\to$62\,ms, \textbf{2.0$\times$})。 E2E 62$\to$50\,ms, \textbf{1.2$\times$})。累積的に: \textbf{98$\times$}の改善(4{,}918\,ms to 50\,ms)、108\,msの16Kトークンルーティング、800\,MB未満のルータGPUフットプリント – 専用アクセラレータの必要性を補完し排除するLLMを備えたGPUを共有するには十分小さい。 Stage~1のターゲットはAMD ROCm(NVIDIA GPUはすでにcuDNN経由でFlashAttentionを持っている)、Stages~2と~3はハードウェアに依存しない。

論文の概要: 98$\times$ Faster LLM Routing Without a Dedicated GPU: Flash Attention, Prompt Compression, and Near-Streaming for the vLLM Semantic Router

関連論文リスト