FullDiT: Multi-Task Video Generative Foundation Model with Full Attention
- URL: http://arxiv.org/abs/2503.19907v1
- Date: Tue, 25 Mar 2025 17:59:06 GMT
- Title: FullDiT: Multi-Task Video Generative Foundation Model with Full Attention
- Authors: Xuan Ju, Weicai Ye, Quande Liu, Qiulin Wang, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Qiang Xu,
- Abstract summary: FullDiT is a unified foundation model for video generation that seamlessly integrates multiple conditions via unified full-attention mechanisms.<n> Experiments demonstrate that FullDiT achieves state-of-the-art results, highlighting the efficacy of full-attention in complex multi-task video generation.
- Score: 37.776430879317765
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current video generative foundation models primarily focus on text-to-video tasks, providing limited control for fine-grained video content creation. Although adapter-based approaches (e.g., ControlNet) enable additional controls with minimal fine-tuning, they encounter challenges when integrating multiple conditions, including: branch conflicts between independently trained adapters, parameter redundancy leading to increased computational cost, and suboptimal performance compared to full fine-tuning. To address these challenges, we introduce FullDiT, a unified foundation model for video generation that seamlessly integrates multiple conditions via unified full-attention mechanisms. By fusing multi-task conditions into a unified sequence representation and leveraging the long-context learning ability of full self-attention to capture condition dynamics, FullDiT reduces parameter overhead, avoids conditions conflict, and shows scalability and emergent ability. We further introduce FullBench for multi-task video generation evaluation. Experiments demonstrate that FullDiT achieves state-of-the-art results, highlighting the efficacy of full-attention in complex multi-task video generation.
Related papers
- DiVE: Efficient Multi-View Driving Scenes Generation Based on Video Diffusion Transformer [56.98400572837792]
DiVE produces high-fidelity, temporally coherent, and cross-view consistent multi-view videos.
These innovations collectively achieve a 2.62x speedup with minimal quality degradation.
arXiv Detail & Related papers (2025-04-28T09:20:50Z) - UniCombine: Unified Multi-Conditional Combination with Diffusion Transformer [24.159791066104358]
We introduce a DiT-based multi-conditional controllable generative framework capable of handling any combination of conditions.<n>Specifically, we introduce a novel MMDiT Attention mechanism and incorporate a trainable LoRA module.<n>We also propose a new pipeline to construct SubjectSpatial200K, the first dataset designed for multi-conditional generative tasks.
arXiv Detail & Related papers (2025-03-12T11:22:47Z) - DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation [54.30327187663316]
DiTCtrl is a training-free multi-prompt video generation method under MM-DiT architectures for the first time.<n>We analyze MM-DiT's attention mechanism, finding that the 3D full attention behaves similarly to that of the cross/self-attention blocks in the UNet-like diffusion models.<n>Based on our careful design, the video generated by DiTCtrl achieves smooth transitions and consistent object motion given multiple sequential prompts.
arXiv Detail & Related papers (2024-12-24T18:51:19Z) - Beyond Generation: Unlocking Universal Editing via Self-Supervised Fine-Tuning [45.64777118760738]
UES (Unlocking Universal Editing via Self-Supervision) is a lightweight self-supervised fine-tuning strategy that transforms generation models into unified generation-editing systems.<n>Our approach establishes a dual-conditioning mechanism where original video-text pairs jointly provide visual and textual semantics.<n>To enable systematic evaluation, we introduce OmniBench-99, a comprehensive benchmark spanning 99 videos across humans/animals, environments, and objects.
arXiv Detail & Related papers (2024-12-03T03:10:19Z) - DiVE: DiT-based Video Generation with Enhanced Control [23.63288169762629]
We propose first DiT-based framework specifically designed for generating temporally and multi-view consistent videos.
Specifically, the proposed framework leverages a parameter-free spatial view-inflated attention mechanism to guarantee the cross-view consistency.
arXiv Detail & Related papers (2024-09-03T04:29:59Z) - VIMI: Grounding Video Generation through Multi-modal Instruction [89.90065445082442]
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining.
We construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts.
We finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions.
arXiv Detail & Related papers (2024-07-08T18:12:49Z) - OmniControlNet: Dual-stage Integration for Conditional Image Generation [61.1432268643639]
We provide a two-way integration for the widely adopted ControlNet by integrating external condition generation algorithms into a single dense prediction method.
Our proposed OmniControlNet consolidates 1) the condition generation by a single multi-tasking dense prediction algorithm under the task embedding guidance and 2) the image generation process for different conditioning types under the textual embedding guidance.
arXiv Detail & Related papers (2024-06-09T18:03:47Z) - Towards Multimodal Video Paragraph Captioning Models Robust to Missing Modality [26.55645677311152]
Video paragraph captioning (VPC) involves generating detailed narratives for long videos.
Existing models are constrained by the assumption of constant availability of a single auxiliary modality.
We propose a Missing-Resistant framework that harnesses all available auxiliary inputs and maintains resilience even in the absence of certain modalities.
arXiv Detail & Related papers (2024-03-28T08:35:46Z) - CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion [58.15403987979496]
CREMA is a generalizable, highly efficient, and modular modality-fusion framework for video reasoning.<n>We propose a novel progressive multimodal fusion design supported by a lightweight fusion module and modality-sequential training strategy.<n>We validate our method on 7 video-language reasoning tasks assisted by diverse modalities, including VideoQA and Video-Audio/3D/Touch/Thermal QA.
arXiv Detail & Related papers (2024-02-08T18:27:22Z) - MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form
Video Question Answering [73.61182342844639]
We introduce a new model named Multi-modal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA.
MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules.
Visual concepts at different granularities are then processed efficiently through an attention module.
arXiv Detail & Related papers (2022-12-19T15:05:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.