Fugu-MT 論文翻訳(概要): FrameDiT: Diffusion Transformer with Frame-Level Matrix Attention for Efficient Video Generation

論文の概要: FrameDiT: Diffusion Transformer with Frame-Level Matrix Attention for Efficient Video Generation

arxiv url: http://arxiv.org/abs/2603.09721v1
Date: Tue, 10 Mar 2026 14:28:32 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-11 15:25:24.394105
Title: FrameDiT: Diffusion Transformer with Frame-Level Matrix Attention for Efficient Video Generation
Title（参考訳）: FrameDiT:効率的なビデオ生成のためのフレームレベル行列アテンション付き拡散変換器
Authors: Minh Khoa Le, Kien Do, Duc Thanh Nguyen, Truyen Tran,
Abstract要約: マトリックス注意(Matrix Attention)は、フレーム全体をマトリックスとして処理するフレームレベルの時間的注意機構である。我々は、Matrix Attention に基づく DiT アーキテクチャである FrameDiT-G を構築し、さらに、Matrix Attention と Local Factorized Attention を統合して、大小両方の動きをキャプチャする FrameDiT-H を導入する。
参考スコア（独自算出の注目度）: 24.0898579088124
License: http://creativecommons.org/licenses/by/4.0/
Abstract: High-fidelity video generation remains challenging for diffusion models due to the difficulty of modeling complex spatio-temporal dynamics efficiently. Recent video diffusion methods typically represent a video as a sequence of spatio-temporal tokens which can be modeled using Diffusion Transformers (DiTs). However, this approach faces a trade-off between the strong but expensive Full 3D Attention and the efficient but temporally limited Local Factorized Attention. To resolve this trade-off, we propose Matrix Attention, a frame-level temporal attention mechanism that processes an entire frame as a matrix and generates query, key, and value matrices via matrix-native operations. By attending across frames rather than tokens, Matrix Attention effectively preserves global spatio-temporal structure and adapts to significant motion. We build FrameDiT-G, a DiT architecture based on MatrixAttention, and further introduce FrameDiT-H, which integrates Matrix Attention with Local Factorized Attention to capture both large and small motion. Extensive experiments show that FrameDiT-H achieves state-of-the-art results across multiple video generation benchmarks, offering improved temporal coherence and video quality while maintaining efficiency comparable to Local Factorized Attention.
Abstract（参考訳）: 高忠実度ビデオ生成は、複雑な時空間力学を効率的にモデル化することが困難であるため、拡散モデルでは依然として困難である。最近のビデオ拡散法は、ビデオを拡散変換器(DiT)を用いてモデル化できる時空間トークンの列として表すのが一般的である。しかし、このアプローチは、強力だが高価なフル3D注意と、効率的だが時間的に制限された局所的要因意識とのトレードオフに直面している。このトレードオフを解決するために,フレーム全体をマトリックスとして処理し,問合せ,キー,値行列を行列ネイティブ操作で生成するフレームレベルの時間的注意機構であるMatrix Attentionを提案する。トークンではなくフレームを横切ることで、マトリックス注意はグローバルな時空間構造を効果的に保存し、大きな動きに適応する。我々は、MatrixAttention に基づく DiT アーキテクチャである FrameDiT-G を構築し、さらに、Matrix Attention と Local Factorized Attention を統合して、大小両方の動きをキャプチャする FrameDiT-H を導入する。大規模な実験により、FrameDiT-Hは複数のビデオ生成ベンチマークにまたがって最先端の結果を達成し、時間的コヒーレンスとビデオ品質を改善し、局所的要因の注意に匹敵する効率を維持した。

論文の概要: FrameDiT: Diffusion Transformer with Frame-Level Matrix Attention for Efficient Video Generation

関連論文リスト