Related papers: FullDiT2: Efficient In-Context Conditioning for Video Diffusion Transformers

FullDiT2: Efficient In-Context Conditioning for Video Diffusion Transformers

URL: http://arxiv.org/abs/2506.04213v2
Date: Thu, 05 Jun 2025 03:35:21 GMT
Title: FullDiT2: Efficient In-Context Conditioning for Video Diffusion Transformers
Authors: Xuanhua He, Quande Liu, Zixuan Ye, Weicai Ye, Qiulin Wang, Xintao Wang, Qifeng Chen, Pengfei Wan, Di Zhang, Kun Gai,
Abstract summary: FullDiT2 is an efficient in-context conditioning framework for general controllability in both video generation and editing tasks.<n>FullDiT2 achieves significant computation reduction and 2-3 times speedup in averaged time cost per diffusion step.
Score: 63.788600404496115
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Fine-grained and efficient controllability on video diffusion transformers has raised increasing desires for the applicability. Recently, In-context Conditioning emerged as a powerful paradigm for unified conditional video generation, which enables diverse controls by concatenating varying context conditioning signals with noisy video latents into a long unified token sequence and jointly processing them via full-attention, e.g., FullDiT. Despite their effectiveness, these methods face quadratic computation overhead as task complexity increases, hindering practical deployment. In this paper, we study the efficiency bottleneck neglected in original in-context conditioning video generation framework. We begin with systematic analysis to identify two key sources of the computation inefficiencies: the inherent redundancy within context condition tokens and the computational redundancy in context-latent interactions throughout the diffusion process. Based on these insights, we propose FullDiT2, an efficient in-context conditioning framework for general controllability in both video generation and editing tasks, which innovates from two key perspectives. Firstly, to address the token redundancy, FullDiT2 leverages a dynamic token selection mechanism to adaptively identify important context tokens, reducing the sequence length for unified full-attention. Additionally, a selective context caching mechanism is devised to minimize redundant interactions between condition tokens and video latents. Extensive experiments on six diverse conditional video editing and generation tasks demonstrate that FullDiT2 achieves significant computation reduction and 2-3 times speedup in averaged time cost per diffusion step, with minimal degradation or even higher performance in video generation quality. The project page is at \href{https://fulldit2.github.io/}{https://fulldit2.github.io/}.

Related papers

Taming Hallucinations: Boosting MLLMs' Video Understanding via Counterfactual Video Generation [37.975475053907545]
Multimodal Large Language Models (MLLMs) have made remarkable progress in video understanding.<n>They suffer from a critical vulnerability: an over-reliance on language priors, which can lead to visual ungrounded hallucinations.<n>We introduce DualityForge, a novel data synthesis framework that employs controllable, diffusion-based video editing.
arXiv Detail & Related papers (2025-12-30T14:53:33Z)
LoViC: Efficient Long Video Generation with Context Compression [68.22069741704158]
We introduce LoViC, a DiT-based framework trained on million-scale open-domain videos.<n>At the core of our approach is FlexFormer, an expressive autoencoder that jointly compresses video and text into unified latent representations.
arXiv Detail & Related papers (2025-07-17T09:46:43Z)
FullDiT: Multi-Task Video Generative Foundation Model with Full Attention [37.776430879317765]
FullDiT is a unified foundation model for video generation that seamlessly integrates multiple conditions via unified full-attention mechanisms.<n> Experiments demonstrate that FullDiT achieves state-of-the-art results, highlighting the efficacy of full-attention in complex multi-task video generation.
arXiv Detail & Related papers (2025-03-25T17:59:06Z)
OminiControl2: Efficient Conditioning for Diffusion Transformers [68.3243031301164]
We present OminiControl2, an efficient framework that achieves efficient image-conditional image generation.<n>OminiControl2 introduces two key innovations: (1) a dynamic compression strategy that streamlines conditional inputs by preserving only the most semantically relevant tokens during generation, and (2) a conditional feature reuse mechanism that computes condition token features only once and reuses them across denoising steps.
arXiv Detail & Related papers (2025-03-11T10:50:14Z)
Enhancing Low-Cost Video Editing with Lightweight Adaptors and Temporal-Aware Inversion [28.38307253613529]
We propose a framework that integrates temporal-spatial and semantic consistency with Baliteral DDIM inversion.<n>Our method significantly improves perceptual quality, text-image alignment, and temporal coherence, as demonstrated on the MSR-VTT dataset.
arXiv Detail & Related papers (2025-01-08T16:41:31Z)
Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding [11.211803499867639]
We propose DYTO, a novel dynamic token merging framework for zero-shot video understanding.<n> DYTO integrates a hierarchical frame selection and a bipartite token merging strategy to dynamically cluster key frames and selectively compress token sequences.<n>Experiments demonstrate the effectiveness of DYTO, achieving superior performance compared to both fine-tuned and training-free methods.
arXiv Detail & Related papers (2024-11-21T18:30:11Z)
COVE: Unleashing the Diffusion Feature Correspondence for Consistent Video Editing [57.76170824395532]
Video editing is an emerging task, in which most current methods adopt the pre-trained text-to-image (T2I) diffusion model to edit the source video.<n>We propose COrrespondence-guided Video Editing (COVE) to achieve high-quality and consistent video editing.<n>COVE can be seamlessly integrated into the pre-trained T2I diffusion model without the need for extra training or optimization.
arXiv Detail & Related papers (2024-06-13T06:27:13Z)
Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation [49.298187741014345]
Current methods intertwine spatial content and temporal dynamics together, leading to an increased complexity of text-to-video generation (T2V) We propose HiGen, a diffusion model-based method that improves performance by decoupling the spatial and temporal factors of videos from two perspectives.
arXiv Detail & Related papers (2023-12-07T17:59:07Z)
Transform-Equivariant Consistency Learning for Temporal Sentence Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video. Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted. In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.