Fugu-MT 論文翻訳(概要): VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion

論文の概要: VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion

arxiv url: http://arxiv.org/abs/2605.30351v1
Date: Thu, 28 May 2026 17:59:57 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-30 02:45:56.763484
Title: VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion
Title（参考訳）: VideoMLA: 短時間の自己回帰ビデオ拡散のための低ランク遅延KVキャッシュ
Authors: Hidir Yesiltepe, Jiazhen Hu, Tuna Han Salih Meral, Adil Kaan Akan, Kaan Oktay, Hoda Eldardiry, Pinar Yanardag,
Abstract要約: ビデオ拡散におけるMLA(Multi-Head Latent Attention)の最初の研究について述べる。 VideoMLAは、ヘッド単位のキーと値を共有低ランクコンテンツラテントと共有非結合3D-RoPE位置キーで置き換える。 VBenchでは、ビデオMLAは短水平ストリーミングビデオ拡散ベースラインと一致し、評価方法の長い地平線で最高の総合スコアを獲得し、1つのB200上でのスループットを1.23倍改善する。
参考スコア（独自算出の注目度）: 18.312530927511606
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Long-rollout causal video diffusion has converged on a fixed-size sliding-window KV cache, with recent progress innovating within this layout by changing which tokens occupy the window or how their positions are encoded. The per-head KV layout itself, a dominant contributor to streaming memory and latency, has been mostly left unchanged. In this paper, we present the first study of Multi-Head Latent Attention (MLA) in video diffusion. VideoMLA replaces per-head keys and values with a shared low-rank content latent and a shared decoupled 3D-RoPE positional key, reducing per-token KV memory by 92.7% at every cached layer. We further investigate why MLA succeeds in video diffusion even though the spectral assumption often used to motivate it in language models does not hold: pretrained video attention is not low-rank, with 99%-energy effective rank far above any practical latent dimension. VideoMLA retains quality at compression ratios where direct spectral approximation would predict large reconstruction error. We show that the MLA bottleneck, rather than the pretrained spectrum, determines the effective rank: both spectral and random initialization occupy nearly the full rank budget from initialization, and training preserves this budget while adapting within it. On VBench, VideoMLA matches short-horizon streaming video diffusion baselines, achieves the best overall score at long horizons among evaluated methods, and improves throughput by 1.23x on a single B200.
Abstract（参考訳）: 固定サイズのスライドウインドウKVキャッシュに長時間の因果ビデオ拡散が収束し,どのトークンがウィンドウを占めるか,どのように位置が符号化されるかを変えることで,近年,このレイアウト内での革新が進んでいる。ヘッド単位のKVレイアウト自体は、ストリーミングメモリとレイテンシーの主要なコントリビュータであり、ほとんど変更されていない。本稿では,ビデオ拡散におけるMLA(Multi-Head Latent Attention)の最初の研究について述べる。 VideoMLAは、ヘッド単位のキーと値を共有低ランクのコンテンツラテントと共有非結合の3D-RoPE位置キーに置き換え、キャッシュ層毎に92.7%のKVメモリを削減した。さらに,MLAがビデオ拡散に成功している理由を,言語モデルでよく使われるスペクトル仮定が保たないにもかかわらず検討する。事前学習されたビデオの注意は低ランクではなく,99%の有効ランクが実用的な潜伏次元よりはるかに上である。 VideoMLAは、直接スペクトル近似が大規模な再構成誤差を予測する圧縮比で品質を維持する。スペクトル初期化とランダム初期化の両方が初期化からほぼ全階級予算を占めており、トレーニングは内部に適応しながらこの予算を維持している。 VBenchでは、ビデオMLAは短水平ストリーミングビデオ拡散ベースラインと一致し、評価方法の長い地平線で最高の総合スコアを獲得し、1つのB200上でのスループットを1.23倍改善する。

論文の概要: VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion

関連論文リスト