Fugu-MT 論文翻訳(概要): KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference

論文の概要: KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference

arxiv url: http://arxiv.org/abs/2605.12471v1
Date: Tue, 12 May 2026 17:53:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-13 21:48:57.070925
Title: KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference
Title（参考訳）: KV-Fold:ロングコンテキスト推論のためのワンステップKVキャッシュ再帰
Authors: Alireza Nadali, Patrick Cooper, Ashutosh Trivedi, Alvaro Velasquez,
Abstract要約: KV-Foldは、キー値(KV)キャッシュを列チャンク上の左折り重なりのアキュムレータとして扱う、トレーニング不要な長文推論プロトコルである。各ステップで、モデルは蓄積されたキャッシュに条件付けられた次のチャンクを処理し、新しく生成されたキーと値を付加し、拡張されたキャッシュを前方に渡す。 Llama-3.1-8Bでは、16Kから128Kのトークンのコンテキストにまたがる152のトライアルで100%の正確なマッチ検索を実現し、単一の40GB GPUのメモリ制限内に留まっている。
参考スコア（独自算出の注目度）: 9.84177443010824
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce KV-Fold, a simple, training-free long-context inference protocol that treats the key-value (KV) cache as the accumulator in a left fold over sequence chunks. At each step, the model processes the next chunk conditioned on the accumulated cache, appends the newly produced keys and values, and passes the enlarged cache forward; the same one-step update is applied repeatedly, analogous to foldl in functional programming. Building on the KV cache concatenation primitive introduced for latent multi-agent communication, we repurpose it as a chunk-to-chunk recurrence for long-context inference. When processing chunk t, the model attends to the KV cache carried from earlier chunks as a prefix, reusing its internal state across segments without modifying or retraining the model. Despite its simplicity, the induced recurrence is stable: per-step drift rises briefly and then saturates into a flat plateau that persists across deep chains. This plateau is insensitive to a 10,000x change in numerical precision, robust across chunk sizes, and consistent across model families. At the task level, KV-Fold preserves exact information over long distances. On a needle-in-a-haystack benchmark, it achieves 100% exact-match retrieval across 152 trials spanning contexts from 16K to 128K tokens and chain depths up to 511 on Llama-3.1-8B, while remaining within the memory limits of a single 40GB GPU. Compared to streaming methods, which trade fidelity for bounded memory, KV-Fold maintains long-range retrieval while operating as a sequence of tractable forward passes. Overall, our results show that frozen pretrained transformers already support a stable form of KV-cache recurrence, providing a practical route to long-context inference without architectural changes or training.
Abstract（参考訳）: KV-Foldは、キー値(KV)キャッシュを列チャンク上の左折りたたみ器として扱う、単純でトレーニング不要な長文推論プロトコルである。各ステップにおいて、モデルは蓄積されたキャッシュに条件付き次のチャンクを処理し、新しく生成されたキーと値を付加し、拡張されたキャッシュを前方に通過する。遅延マルチエージェント通信のために導入されたKVキャッシュ結合プリミティブに基づいて、長文推論のためのチャンク・ツー・チャンクの繰り返しとして再利用する。チャンクtを処理する場合、モデルは以前のチャンクから運ばれたKVキャッシュにプレフィックスとして参加し、モデルを変更または再トレーニングすることなくセグメント全体の内部状態を再利用する。その単純さにもかかわらず、誘発される再発は安定であり、ステップごとのドリフトは一時的に上昇し、深い鎖にまたがる平らな台地へと飽和する。この台地は1万倍の精度の数値変化に敏感であり、チャンクサイズで頑丈であり、モデルファミリで一貫している。タスクレベルでは、KV-Foldは長距離にわたって正確な情報を保持する。ニードル・イン・ア・ヘイスタックのベンチマークでは、16Kから128KのトークンとLlama-3.1-8Bのチェーン深さのコンテキストにまたがる152回の試行で100%正確なマッチング検索を達成し、単一の40GB GPUのメモリ限界内に留まる。 KV-Foldは、有界メモリに対して忠実さを交換するストリーミング方式と比較して、トラクタブルフォワードパスのシーケンスとして動作しながら、長距離検索を継続する。以上の結果から, 凍結事前学習型トランスフォーマーは, KV-cache 再帰の安定な形態をすでにサポートしており, アーキテクチャ変更やトレーニングを伴わずに, 長文推論への実践的な経路を提供することができた。

論文の概要: KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference

関連論文リスト