Fugu-MT 論文翻訳(概要): FullDiT2: Efficient In-Context Conditioning for Video Diffusion Transformers

論文の概要: FullDiT2: Efficient In-Context Conditioning for Video Diffusion Transformers

arxiv url: http://arxiv.org/abs/2506.04213v1
Date: Wed, 04 Jun 2025 17:57:09 GMT
ステータス: 翻訳完了
システム内更新日: 2025-06-05 21:20:14.511145
Title: FullDiT2: Efficient In-Context Conditioning for Video Diffusion Transformers
Title（参考訳）: FullDiT2:ビデオ拡散変換器の効率的なインコンディショニング
Authors: Xuanhua He, Quande Liu, Zixuan Ye, Wecai Ye, Qiulin Wang, Xintao Wang, Qifeng Chen, Pengfei Wan, Di Zhang, Kun Gai,
Abstract要約: FullDiT2は、ビデオ生成と編集の両方における一般的な制御性のための効率的なコンテキスト内条件付けフレームワークである。 FullDiT2は、拡散ステップ当たりの平均時間コストにおいて、計算の大幅な削減と2～3倍の高速化を実現している。
参考スコア（独自算出の注目度）: 63.788600404496115
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Fine-grained and efficient controllability on video diffusion transformers has raised increasing desires for the applicability. Recently, In-context Conditioning emerged as a powerful paradigm for unified conditional video generation, which enables diverse controls by concatenating varying context conditioning signals with noisy video latents into a long unified token sequence and jointly processing them via full-attention, e.g., FullDiT. Despite their effectiveness, these methods face quadratic computation overhead as task complexity increases, hindering practical deployment. In this paper, we study the efficiency bottleneck neglected in original in-context conditioning video generation framework. We begin with systematic analysis to identify two key sources of the computation inefficiencies: the inherent redundancy within context condition tokens and the computational redundancy in context-latent interactions throughout the diffusion process. Based on these insights, we propose FullDiT2, an efficient in-context conditioning framework for general controllability in both video generation and editing tasks, which innovates from two key perspectives. Firstly, to address the token redundancy, FullDiT2 leverages a dynamic token selection mechanism to adaptively identify important context tokens, reducing the sequence length for unified full-attention. Additionally, a selective context caching mechanism is devised to minimize redundant interactions between condition tokens and video latents. Extensive experiments on six diverse conditional video editing and generation tasks demonstrate that FullDiT2 achieves significant computation reduction and 2-3 times speedup in averaged time cost per diffusion step, with minimal degradation or even higher performance in video generation quality. The project page is at \href{https://fulldit2.github.io/}{https://fulldit2.github.io/}.
Abstract（参考訳）: ビデオ拡散変換器の微粒化と効率的な制御性は、適用性への欲求を高めている。近年、In-context Conditioningは統一された条件付きビデオ生成のための強力なパラダイムとして登場し、様々なコンテキスト条件信号とノイズの多いビデオラプタントを長い統一トークンシーケンスに結合し、フルアテンション(例えばFullDiT)を介して共同処理することで、多様な制御を可能にする。有効性にもかかわらず、これらの手法はタスクの複雑さが増大するにつれて2次計算のオーバーヘッドに直面し、実際の展開を妨げる。本稿では,コンテキスト内条件付きビデオ生成フレームワークにおいて無視される効率のボトルネックについて検討する。まず,文脈条件トークン内の固有冗長性と,拡散過程全体における文脈遅延相互作用における計算冗長性という,計算非効率の2つの重要な原因を特定する。これらの知見に基づいて、ビデオ生成と編集の両タスクにおいて、汎用的な制御性を実現するための効率的なコンテキスト内条件付きフレームワークFullDiT2を提案する。まず、トークンの冗長性に対処するため、FullDiT2は動的トークン選択機構を利用して重要なコンテキストトークンを適応的に識別し、統合されたフルアテンションのためのシーケンス長を削減する。さらに、条件トークンとビデオラテント間の冗長な相互作用を最小限に抑えるために、選択的なコンテキストキャッシュ機構が考案されている。 6つの条件付きビデオ編集および生成タスクに関する広範囲な実験により、FullDiT2は、ビデオ生成品質の最小限の劣化またはさらに高い性能で、拡散ステップ当たりの平均時間コストにおいて、計算の大幅な削減と2～3倍の高速化を実現していることが示された。プロジェクトページは \href{https://fulldit2.github.io/}{https://fulldit2.github.io/} にある。

論文の概要: FullDiT2: Efficient In-Context Conditioning for Video Diffusion Transformers

関連論文リスト