Fugu-MT 論文翻訳(概要): Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing

論文の概要: Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing

arxiv url: http://arxiv.org/abs/2604.22782v1
Date: Fri, 03 Apr 2026 14:56:17 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-04 02:32:14.17705
Title: Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing
Title（参考訳）: 確率的KVルーティング: 適応的な深さのキャッシュ共有を実現する
Authors: Anastasiia Filippova, David Grangier, Marco Cuturi, João Monteiro,
Abstract要約: 高いスループットでトランスフォーマー言語モデルを実行するには、冗長な計算を避けるためにキーバリュー(KV)をキャッシュする必要がある。 KVキャッシュのメモリフットプリントは著しく、サービスコストに大きな影響を与えます。本稿では,ランダムな層間注意(ランダムな層間注意,ランダムな層間注意,ランダムな層間注意)を提案する。
参考スコア（独自算出の注目度）: 29.913403615975174
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Serving transformer language models with high throughput requires caching Key-Values (KVs) to avoid redundant computation during autoregressive generation. The memory footprint of KV caching is significant and heavily impacts serving costs. This work proposes to lessen these memory requirements. While recent work has largely addressed KV cache reduction via compression and eviction along the temporal axis, we argue that the \emph{depth} dimension offers an orthogonal and robust avenue for optimization. Although prior research suggests that a full cache for every layer is redundant, implementing cross-layer cache sharing remains a practical challenge; existing methods typically suffer from reduced throughput or increased time-to-first-token. In this paper, we demonstrate that dropping a layer's cache offers efficient optimization without information loss. We propose a simple training approach: random cross-layer attention. During training, layers randomly choose to attend either to their own KV states or those of a preceding layer. This stochastic process adapts the model to be robust to various depth-wise cache sharing strategies, ensuring flexibility for unknown hardware constraints at deployment time. Our evaluations show that applying this scheme during pre-training or fine-tuning enables depth-wise cache sharing for various model families. Furthermore, for larger models in data-constrained settings, this approach is suggestive of a regularization-like effect, frequently preserving or improving performance while significantly reducing the cache's memory footprint.
Abstract（参考訳）: 高スループットでトランスフォーマー言語モデルを実行するには、自動回帰生成時に冗長な計算を避けるためにキーバリュー(KV)をキャッシュする必要がある。 KVキャッシュのメモリフットプリントは著しく、サービスコストに大きな影響を与えます。この研究は、これらのメモリ要求を減らすことを提案する。最近の研究は、時間軸に沿った圧縮と消去によるKVキャッシュの削減に大きく取り組んできたが、我々は \emph{depth} 次元が最適化の直交的かつ頑健な道を提供すると主張している。以前の調査では、すべてのレイヤの完全なキャッシュは冗長であることを示しているが、レイヤ間のキャッシュ共有の実装は依然として現実的な課題である。本稿では,レイヤのキャッシュを落とせば,情報損失を伴わずに効率的に最適化できることを示す。本稿では,ランダムなクロスレイヤーアテンションという簡単なトレーニング手法を提案する。トレーニング中、レイヤはランダムに、自身のKV状態または前のレイヤのいずれかに参加することを選択します。この確率的プロセスは、モデルを様々な深さのキャッシュ共有戦略に堅牢に適応させ、デプロイ時に未知のハードウェア制約に対する柔軟性を確保する。評価の結果,事前学習や微調整にこの手法を適用することで,様々なモデルファミリに対して深いキャッシュ共有が可能であることが示唆された。さらに、データ制約設定におけるより大きなモデルでは、キャッシュのメモリフットプリントを著しく削減しつつ、頻繁に保存または改善する正規化のような効果が示唆される。

論文の概要: Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing

関連論文リスト