Related papers: FullDiT: Multi-Task Video Generative Foundation Model with Full Attention

FullDiT: Multi-Task Video Generative Foundation Model with Full Attention

URL: http://arxiv.org/abs/2503.19907v1
Date: Tue, 25 Mar 2025 17:59:06 GMT
Title: FullDiT: Multi-Task Video Generative Foundation Model with Full Attention
Authors: Xuan Ju, Weicai Ye, Quande Liu, Qiulin Wang, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Qiang Xu,
Abstract summary: FullDiT is a unified foundation model for video generation that seamlessly integrates multiple conditions via unified full-attention mechanisms.<n> Experiments demonstrate that FullDiT achieves state-of-the-art results, highlighting the efficacy of full-attention in complex multi-task video generation.
Score: 37.776430879317765
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current video generative foundation models primarily focus on text-to-video tasks, providing limited control for fine-grained video content creation. Although adapter-based approaches (e.g., ControlNet) enable additional controls with minimal fine-tuning, they encounter challenges when integrating multiple conditions, including: branch conflicts between independently trained adapters, parameter redundancy leading to increased computational cost, and suboptimal performance compared to full fine-tuning. To address these challenges, we introduce FullDiT, a unified foundation model for video generation that seamlessly integrates multiple conditions via unified full-attention mechanisms. By fusing multi-task conditions into a unified sequence representation and leveraging the long-context learning ability of full self-attention to capture condition dynamics, FullDiT reduces parameter overhead, avoids conditions conflict, and shows scalability and emergent ability. We further introduce FullBench for multi-task video generation evaluation. Experiments demonstrate that FullDiT achieves state-of-the-art results, highlighting the efficacy of full-attention in complex multi-task video generation.

Related papers

DiVE: Efficient Multi-View Driving Scenes Generation Based on Video Diffusion Transformer [56.98400572837792]
DiVE produces high-fidelity, temporally coherent, and cross-view consistent multi-view videos. These innovations collectively achieve a 2.62x speedup with minimal quality degradation.
arXiv Detail & Related papers (2025-04-28T09:20:50Z)
UniCombine: Unified Multi-Conditional Combination with Diffusion Transformer [24.159791066104358]
We introduce a DiT-based multi-conditional controllable generative framework capable of handling any combination of conditions.<n>Specifically, we introduce a novel MMDiT Attention mechanism and incorporate a trainable LoRA module.<n>We also propose a new pipeline to construct SubjectSpatial200K, the first dataset designed for multi-conditional generative tasks.
arXiv Detail & Related papers (2025-03-12T11:22:47Z)
DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation [54.30327187663316]
DiTCtrl is a training-free multi-prompt video generation method under MM-DiT architectures for the first time.<n>We analyze MM-DiT's attention mechanism, finding that the 3D full attention behaves similarly to that of the cross/self-attention blocks in the UNet-like diffusion models.<n>Based on our careful design, the video generated by DiTCtrl achieves smooth transitions and consistent object motion given multiple sequential prompts.
arXiv Detail & Related papers (2024-12-24T18:51:19Z)
Beyond Generation: Unlocking Universal Editing via Self-Supervised Fine-Tuning [45.64777118760738]
UES (Unlocking Universal Editing via Self-Supervision) is a lightweight self-supervised fine-tuning strategy that transforms generation models into unified generation-editing systems.<n>Our approach establishes a dual-conditioning mechanism where original video-text pairs jointly provide visual and textual semantics.<n>To enable systematic evaluation, we introduce OmniBench-99, a comprehensive benchmark spanning 99 videos across humans/animals, environments, and objects.
arXiv Detail & Related papers (2024-12-03T03:10:19Z)
DiVE: DiT-based Video Generation with Enhanced Control [23.63288169762629]
We propose first DiT-based framework specifically designed for generating temporally and multi-view consistent videos. Specifically, the proposed framework leverages a parameter-free spatial view-inflated attention mechanism to guarantee the cross-view consistency.
arXiv Detail & Related papers (2024-09-03T04:29:59Z)
VIMI: Grounding Video Generation through Multi-modal Instruction [89.90065445082442]
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining. We construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts. We finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions.
arXiv Detail & Related papers (2024-07-08T18:12:49Z)
OmniControlNet: Dual-stage Integration for Conditional Image Generation [61.1432268643639]
We provide a two-way integration for the widely adopted ControlNet by integrating external condition generation algorithms into a single dense prediction method. Our proposed OmniControlNet consolidates 1) the condition generation by a single multi-tasking dense prediction algorithm under the task embedding guidance and 2) the image generation process for different conditioning types under the textual embedding guidance.
arXiv Detail & Related papers (2024-06-09T18:03:47Z)
Towards Multimodal Video Paragraph Captioning Models Robust to Missing Modality [26.55645677311152]
Video paragraph captioning (VPC) involves generating detailed narratives for long videos. Existing models are constrained by the assumption of constant availability of a single auxiliary modality. We propose a Missing-Resistant framework that harnesses all available auxiliary inputs and maintains resilience even in the absence of certain modalities.
arXiv Detail & Related papers (2024-03-28T08:35:46Z)
CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion [58.15403987979496]
CREMA is a generalizable, highly efficient, and modular modality-fusion framework for video reasoning.<n>We propose a novel progressive multimodal fusion design supported by a lightweight fusion module and modality-sequential training strategy.<n>We validate our method on 7 video-language reasoning tasks assisted by diverse modalities, including VideoQA and Video-Audio/3D/Touch/Thermal QA.
arXiv Detail & Related papers (2024-02-08T18:27:22Z)
MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering [73.61182342844639]
We introduce a new model named Multi-modal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA. MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules. Visual concepts at different granularities are then processed efficiently through an attention module.
arXiv Detail & Related papers (2022-12-19T15:05:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.