Fugu-MT 論文翻訳(概要): Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion

論文の概要: Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion

arxiv url: http://arxiv.org/abs/2605.16579v2
Date: Wed, 20 May 2026 19:35:18 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-22 16:35:41.866409
Title: Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion
Title（参考訳）: 局所的, 線形的注意: 自己回帰的ビデオ拡散のためのクロスフレームメモリとしての線形注意
Authors: Kunyang Li, Mubarak Shah, Yuzhang Shang,
Abstract要約: ARL2は、二次的なクロスフレームアテンションを固定サイズのリカレント状態に置き換えるハイブリッドアテンションモジュールである。本研究では,フレーム内ソフトマックスブランチとフレーム間リカレント線形ブランチの2つに分割し,ストリームコンテキストの固定サイズ状態を維持する。 75%の層がハイブリッドリニアアテンションに置き換えられ、最大2.26ウォールクロックのスピードアップと54%のメモリ削減を実現した。
参考スコア（独自算出の注目度）: 61.57938553036056
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Autoregressive (AR) video diffusion is a powerful paradigm for streaming and interactive video generation. However, its reliance on softmax self-attention leads to quadratic compute complexity in sequence length and memory usage due to key-value caching, which limits its scalability to long video horizons. Existing remedies (e.g., sparse attention and KV-cache compression) reduce per-step cost but still rely on a linearly growing cache or irreversibly discard past context, and thus fail to address linear memory growth and streaming context management. To address this scalability bottleneck, we propose ARL2 (Attend Locally, Remember Linearly), a hybrid attention module that replaces quadratic cross-frame attention with a fixed-size recurrent state. We decompose self-attention into two branches: an intra-frame softmax branch for spatial detail and local dependencies, and an inter-frame gated recurrent linear branch that maintains a fixed-size state for streaming context. Our key insight is that softmax attention captures fine-grained local interactions, while a recurrent state provides controllable long-range memory. This design achieves linear-time scaling with constant memory while improving temporal consistency over the full-softmax model. To prevent noisy intermediate states from corrupting memory, we update the recurrent state only after the denoised pass. To avoid within-frame information asymmetry, all tokens share the same pre-update state rather than sequential updates. To the best of our knowledge, this is the first work to convert a pretrained AR video diffusion model into a hybrid linear attention architecture, through an efficient two-stage training scheme for AR video. With 75% of layers replaced by hybrid linear attention, the model achieves up to 2.26 wall-clock speedup and 54% memory reduction, while maintaining comparable quality with improving temporal consistency.
Abstract（参考訳）: オートレグレッシブ(AR)ビデオ拡散は、ストリーミングおよびインタラクティブなビデオ生成の強力なパラダイムである。しかし、ソフトマックスの自己アテンションに依存しているため、キー値キャッシングによるシーケンス長とメモリ使用量の2次計算が複雑になり、そのスケーラビリティは長いビデオ水平線に制限される。既存の改善(例えば、スパースアテンションとKV-キャッシュ圧縮)はステップ単位のコストを削減しますが、依然として線形に増大するキャッシュに依存しています。このスケーラビリティのボトルネックに対処するため、我々は、二次的クロスフレームアテンションを固定サイズのリカレント状態に置き換えるハイブリッドアテンションモジュールARL2(Attend Locally, Remember Linearly)を提案する。我々は,フレーム内ソフトマックス分岐を空間的詳細と局所的依存関係に分割し,フレーム間ゲート型リカレント線形分岐をストリーミングコンテキストに固定サイズ状態を維持する。我々の重要な洞察は、ソフトマックスアテンションはきめ細かい局所的な相互作用を捉え、リカレント状態は制御可能な長距離メモリを提供するということである。この設計は、フルソフトマックスモデル上での時間的一貫性を改善しつつ、一定メモリで線形時間スケーリングを実現する。ノイズの多い中間状態がメモリを劣化させるのを防止するため、復号化後のみ再帰状態を更新する。フレーム内の情報非対称性を避けるために、すべてのトークンはシーケンシャルな更新ではなく、同じ事前更新状態を共有する。我々の知る限り、これはARビデオの効率的な2段階トレーニングスキームを通じて、事前訓練されたARビデオ拡散モデルをハイブリッドな線形アテンションアーキテクチャに変換する最初の試みである。 75%の層がハイブリッドリニアアテンションに置き換えられ、最大2.26ウォールクロックのスピードアップと54%のメモリ削減を実現した。

論文の概要: Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion

関連論文リスト