Fugu-MT 論文翻訳(概要): How Much Cache Does Reasoning Need? Depth-Cache Tradeoffs in KV-Compressed Transformers

論文の概要: How Much Cache Does Reasoning Need? Depth-Cache Tradeoffs in KV-Compressed Transformers

arxiv url: http://arxiv.org/abs/2604.17935v1
Date: Mon, 20 Apr 2026 08:15:17 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:52.759651
Title: How Much Cache Does Reasoning Need? Depth-Cache Tradeoffs in KV-Compressed Transformers
Title（参考訳）: キャッシュはどのくらい必要か? KV圧縮トランスの深さキャッシュトレードオフ
Authors: Xiao Wang,
Abstract要約: キーバリュー(KV)キャッシュは、Transformer推論時の主要なメモリボトルネックである。多段階の推論が劣化する前に、いかに積極的に圧縮できるかを考察する。
参考スコア（独自算出の注目度）: 5.705685936981751
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The key-value (KV) cache is the dominant memory bottleneck during Transformer inference, yet little is known theoretically about how aggressively it can be compressed before multi-step reasoning degrades. We study this through $k$-hop pointer chasing on $n$ tokens under a shared KV cache of size $s$, attention dimension $m$, $H$ heads, $p$-bit precision, and a locality-respecting cache controller (satisfied by all standard KV-compression methods). We give three results. (1) Product depth lower bound (conjectured). We conjecture that any such Transformer ($n \geq 4k$, $s \leq \sqrt{n}/4$) requires depth $L = Ω(\lceil k/s \rceil \cdot \lceil \log_2 n/(Hmp) \rceil)$, and isolate the sole remaining gap as a probabilistic step on the joint distribution of cache trace and pointer chain. Unconditionally, we prove a matching upper bound $L = O(\min(k, \lceil k/s \rceil \log s) \cdot \log n/(mp))$ via windowed pointer doubling, and a max-bound $L = Ω(\max(\lceil k/s \rceil, \log n/(Hmp)))$. Closing the conjecture amounts to upgrading max to product. (2) Bandwidth barrier. The product bound binds only when $Hmp \lesssim \log n$. Any lower bound provable via per-window distinguishability counting -- including reachability, bandwidth, and combinations -- cannot exceed $\lceil k/s \rceil$ once $Hmp \geq \log_2 n$. Breaking this requires lifting unconditional communication-complexity bounds for pointer chasing to Cache-Transformer depth. (3) Adaptive vs oblivious error scaling. Under random cache over $T = \lceil \log_2 k \rceil$ doubling stages, oblivious caches give $\Pr[\mathcal{E}] \leq (s/(n-T))^T + 2T^3/n$ (exponential in $T$), while adaptive locality-respecting caches achieve $\Pr[\mathcal{E}] = s/n$ exactly, independent of $T$. The $Ω((n/s)^{T-1})$ separation explains why heavy-hitter eviction empirically dominates random eviction for multi-hop reasoning.
Abstract（参考訳）: キー値(KV)キャッシュは、Transformer推論において支配的なメモリボトルネックである。我々は、$k$-hopポインターで、共有KVキャッシュサイズ$s$、アテンションディメンション$m$、$H$ヘッド、$p$-bit精度、およびローカリティ参照キャッシュコントローラ(すべての標準KV圧縮メソッドに満足)の下で、$n$トークンを追いかける。 3つの結果が得られます。 1)製品深度下限(予定) そのような変換子(n \geq 4k$, $s \leq \sqrt{n}/4$)は、深さ$L = Ω(\lceil k/s \rceil \cdot \lceil \log_2 n/(Hmp) \rceil)$を必要とし、キャッシュトレースとポインタチェインの共分散の確率的なステップとして唯一の残りのギャップを分離する。非条件で、一致する上界$L = O(\min(k, \lceil k/s \rceil \log s) \cdot \log n/(mp))$ を窓付きポインタ倍数で証明し、最大有界$L = Ω(\max(\lceil k/s \rceil, \log n/(Hmp))$ を証明した。予想を閉じると、最大積を製品にアップグレードする。 2)帯域障壁積束縛は$Hmp \lesssim \log n$ のときのみ結合する。ウィンドウごとの識別可能性(リーチビリティ、帯域幅、組み合わせを含む)で証明可能な下限は、$\lceil k/s \rceil$ once $Hmp \geq \log_2 n$を超えることはできない。これを打ち破るには、Cache-Transformerの深さに追従するポインタに対して、無条件の通信複雑な境界を持ち上げる必要がある。 (3)適応的対強大なエラースケーリング。ランダムキャッシュで$T = \lceil \log_2 k \rceil$ 2倍のステージでは、暗黙のキャッシュは$\Pr[\mathcal{E}] \leq (s/(n-T))^T + 2T^3/n$ (exponential in $T$)を与えるが、適応的な局所性参照キャッシュは$\Pr[\mathcal{E}] = s/n$は$T$とは独立である。 Ω((n/s)^{T-1})$分離は、なぜ重ヒッタイデレーションがマルチホップ推論のランダムイデレーションを経験的に支配するかを説明する。

論文の概要: How Much Cache Does Reasoning Need? Depth-Cache Tradeoffs in KV-Compressed Transformers

関連論文リスト