Fugu-MT 論文翻訳(概要): QUOKA: Query-Oriented KV Selection For Efficient LLM Prefill

論文の概要: QUOKA: Query-Oriented KV Selection For Efficient LLM Prefill

arxiv url: http://arxiv.org/abs/2602.08722v1
Date: Mon, 09 Feb 2026 14:32:26 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-10 20:26:25.290985
Title: QUOKA: Query-Oriented KV Selection For Efficient LLM Prefill
Title（参考訳）: Quoka: 効率的なLLMプリフィルのためのクエリ指向KV選択
Authors: Dalton Jones, Junyoung Park, Matthew Morse, Mingu Lee, Chris Lott, Harper Langston,
Abstract要約: 提案するQUoka: クエリ指向のKV選択を効率よく注目する。その結果,QUokaは注目度評価あたりのキー値ペアを88%減らし,ほぼベースライン精度を実現していることがわかった。
参考スコア（独自算出の注目度）: 5.014026212750645
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present QUOKA: Query-oriented KV selection for efficient attention, a training-free and hardware agnostic sparse attention algorithm for accelerating transformer inference under chunked prefill. While many queries focus on a smaller group of keys in the attention operator, we observe that queries with low cosine similarity with respect to the mean query interact more strongly with more keys and have the greatest contribution to final attention logits. By prioritizing these low cosine similarity queries, the behavior of full attention during the prefill stage can be closely approximated. QUOKA leverages this observation, accelerating attention by (1) first retaining a small set of representative queries and (2) then subselectin the keys most aligned with those queries. Through experiments on Needle-In-A-Haystack, LongBench, RULER, and Math500, we show that, while realizing a 3x reduction in time-to-first-token, 5x speedup in attention on Nvidia GPUs and up to nearly a 7x speedup on Intel Xeon CPUs, QUOKA achieves near-baseline accuracy, utilizing 88% fewer key-value pairs per attention evaluation.
Abstract（参考訳）: 提案するQUoka: クエリ指向のKV選択による効率向上, チャンクプリフィル下でのトランスフォーマー推論の高速化のためのトレーニング不要かつハードウェア非依存なスパースアテンションアルゴリズムを提案する。多くのクエリはアテンション演算子内のキーの小さなグループにフォーカスするが、平均的なクエリに対するコサイン類似性の低いクエリは、より多くのキーとより強く相互作用し、最終的なアテンションロジットに最大の貢献をする。これらの低コサイン類似性クエリを優先順位付けすることにより、プリフィル段階における完全な注意の挙動を近似することができる。キューオカはこの観測を利用して、(1) 少数の代表クエリをまず保持し、(2) それらのクエリに最も適したキーをサブセレクトすることで注意を喚起する。 Needle-In-A-Haystack、LongBench、RULER、Math500の実験を通して、Nvidia GPUの3倍のスピードアップ、Intel Xeon CPUの最大7倍のスピードアップを実現する一方で、QUokaは、注目評価あたりのキー値ペアを88%削減してほぼベースライン精度を実現する。

論文の概要: QUOKA: Query-Oriented KV Selection For Efficient LLM Prefill

関連論文リスト