Fugu-MT 論文翻訳(概要): Tensor Cache: Eviction-conditioned Associative Memory for Transformers

論文の概要: Tensor Cache: Eviction-conditioned Associative Memory for Transformers

arxiv url: http://arxiv.org/abs/2605.22884v1
Date: Thu, 21 May 2026 00:21:51 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-25 17:29:20.021402
Title: Tensor Cache: Eviction-conditioned Associative Memory for Transformers
Title（参考訳）: Tensor Cache: トランスフォーマーのエミュレーション条件付連想メモリ
Authors: Kabir Swain, Sijie Han, Daniel Karl I. Weidele, Mauro Martino, Antonio Torralba,
Abstract要約: キャッシュは、第1レベルのキャッシュ(L1)としてスライディングウインドウのソフトマックスの注意を、固定サイズの外積高速なメモリとして、ウィンドウから放出されるKVペアによって供給される第2レベルのキャッシュとする。取り除かれたペアは、層ごとの行列に$A$に圧縮され、単一の行列乗算によって将来のクエリによって読み込まれる。
参考スコア（独自算出の注目度）: 20.67103891489219
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Autoregressive Transformer KV caches grow linearly with context length; sliding-window caching bounds memory but discards evicted tokens entirely, so relevant evidence outside the window becomes inaccessible. We introduce \emph{Tensor Cache}, a two-level cache that pairs sliding-window softmax attention as a first-level cache (L1) with a fixed-size outer-product fast-weight memory as a second-level cache (L2) fed by KV pairs evicted from the window. Recent tokens remain in exact local attention; evicted pairs are compressed into a per-layer matrix $A$ and read by future queries through a single matrix multiplication, exploiting the linear-attention identity $q_t(k_i \otimes v_i)=\langle q_t,k_i\rangle v_i$. A learned scalar gate fuses the L1 and L2 outputs, and per-head decay and write-rate parameters are trained end-to-end. The outer-product memory and the read identity are well-known; our contribution is their use as an L2 cache fed exclusively by sliding-window evictions, plus identifying that the common chunked-mean training shortcut $A\!\leftarrow\!λA\!+\!η(\bar k\!\otimes\!\bar v)$ silently introduces $C^2{-}C$ spurious cross-token outer products per chunk, and closing the gap with a parallel weighted-sum scan equivalent to per-token writes within float32 epsilon. Across systems scaling, controlled associative recall, long-context language modeling, and memory-capacity diagnostics, Tensor Cache improves the memory--quality frontier over bounded-state baselines.
Abstract（参考訳）: 自動回帰トランスフォーマーKVキャッシュは、コンテキスト長とともに線形に成長する。スライディングウィンドウキャッシュはメモリをバウンドするが、取り除かれたトークンを完全に破棄するので、窓の外にある関連する証拠はアクセスできない。ウィンドウから放出されるKVペアによって供給される第2のレベルキャッシュ(L2)として、固定サイズの外積高速メモリを備えた第1のレベルキャッシュ(L1)として、スライドウインドウのソフトマックスアテンションをペアリングする2レベルキャッシュである「emph{Tensor Cache}」を紹介した。取り除かれたペアは、層ごとの行列$A$に圧縮され、1つの行列乗算を通して将来のクエリによって読み取られ、線形アテンションの同一性$q_t(k_i \otimes v_i)=\langle q_t,k_i\rangle v_i$を利用する。学習されたスカラーゲートはL1とL2の出力を融合させ、ヘッド当たりの減衰と書き込みレートパラメータをエンドツーエンドに訓練する。当社のコントリビューションは、スライディングウインドウのエビクションのみに供給されるL2キャッシュとしての使用と、一般的なチャンクアップ平均トレーニングショートカットが$A\! \leftarrow\! λA! +\! η(\bar k\! \otimes\! \bar v)$ サイレントに$C^2{-}C$ 突発的な外積をチャンク毎に導入し、float32 のエプシロン内でのトーケン書き込みに相当する平行重み付きサムスキャンでギャップを閉じる。システムスケーリング、制御された連想リコール、長期コンテキスト言語モデリング、メモリ容量診断などを通じて、Tensor Cacheは境界状態ベースライン上のメモリ品質のフロンティアを改善している。

論文の概要: Tensor Cache: Eviction-conditioned Associative Memory for Transformers

関連論文リスト