Related papers: DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory

DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory

URL: http://arxiv.org/abs/2308.08089v1
Date: Wed, 16 Aug 2023 01:43:41 GMT
Title: DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory
Authors: Shengming Yin, Chenfei Wu, Jian Liang, Jie Shi, Houqiang Li, Gong Ming, Nan Duan
Abstract summary: DragNUWA is an open-domain diffusion-based video generation model. It provides fine-grained control over video content from semantic, spatial, and temporal perspectives. Our experiments validate the effectiveness of DragNUWA, demonstrating its superior performance in fine-grained control in video generation.
Score: 126.4597063554213
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Controllable video generation has gained significant attention in recent years. However, two main limitations persist: Firstly, most existing works focus on either text, image, or trajectory-based control, leading to an inability to achieve fine-grained control in videos. Secondly, trajectory control research is still in its early stages, with most experiments being conducted on simple datasets like Human3.6M. This constraint limits the models' capability to process open-domain images and effectively handle complex curved trajectories. In this paper, we propose DragNUWA, an open-domain diffusion-based video generation model. To tackle the issue of insufficient control granularity in existing works, we simultaneously introduce text, image, and trajectory information to provide fine-grained control over video content from semantic, spatial, and temporal perspectives. To resolve the problem of limited open-domain trajectory control in current research, We propose trajectory modeling with three aspects: a Trajectory Sampler (TS) to enable open-domain control of arbitrary trajectories, a Multiscale Fusion (MF) to control trajectories in different granularities, and an Adaptive Training (AT) strategy to generate consistent videos following trajectories. Our experiments validate the effectiveness of DragNUWA, demonstrating its superior performance in fine-grained control in video generation. The homepage link is \url{https://www.microsoft.com/en-us/research/project/dragnuwa/}

Related papers

MagicMotion: Controllable Video Generation with Dense-to-Sparse Trajectory Guidance [46.92591065065018]
We introduce MagicMotion, an image-to-video generation framework for trajectory-controllable video generation. MagicMotion animates objects along defined trajectories while maintaining object consistency and visual quality. We present MagicData, a large-scale trajectory-controlled video dataset, along with an automated pipeline for annotation and filtering.
arXiv Detail & Related papers (2025-03-20T17:59:42Z)
DiVE: DiT-based Video Generation with Enhanced Control [23.63288169762629]
We propose first DiT-based framework specifically designed for generating temporally and multi-view consistent videos. Specifically, the proposed framework leverages a parameter-free spatial view-inflated attention mechanism to guarantee the cross-view consistency.
arXiv Detail & Related papers (2024-09-03T04:29:59Z)
TrackGo: A Flexible and Efficient Method for Controllable Video Generation [32.906496577618924]
We introduce TrackGo, a novel approach for conditional video generation. TrackGo offers users with a flexible and precise mechanism for manipulating video content. We also propose the TrackAdapter for control implementation.
arXiv Detail & Related papers (2024-08-21T09:42:04Z)
TraDiffusion: Trajectory-Based Training-Free Image Generation [85.39724878576584]
We propose a training-free, trajectory-based controllable T2I approach, termed TraDiffusion. This novel method allows users to effortlessly guide image generation via mouse trajectories.
arXiv Detail & Related papers (2024-08-19T07:01:43Z)
VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control [74.5434726968562]
We tame transformers video for 3D camera control using a ControlNet-like conditioning mechanism based on Plucker coordinates. Our work is the first to enable camera control for transformer-based video diffusion models.
arXiv Detail & Related papers (2024-07-17T17:59:05Z)
PerlDiff: Controllable Street View Synthesis Using Perspective-Layout Diffusion Models [55.080748327139176]
PerlDiff is a method for effective street view image generation that fully leverages perspective 3D geometric information. Our results justify that our PerlDiff markedly enhances the precision of generation on the NuScenes and KITTI datasets.
arXiv Detail & Related papers (2024-07-08T16:46:47Z)
FreeTraj: Tuning-Free Trajectory Control in Video Diffusion Models [41.006754386910686]
We argue that diffusion model itself allows decent control over the generated content without requiring any training. We introduce a tuning-free framework to achieve trajectory-controllable video generation, by imposing guidance on both noise construction and attention computation.
arXiv Detail & Related papers (2024-06-24T17:59:56Z)
Ctrl-V: Higher Fidelity Video Generation with Bounding-Box Controlled Object Motion [8.068194154084967]
This paper tackles a challenge of how to exert precise control over object motion for realistic video synthesis. To accomplish this, we control object movements using bounding boxes and extend this control to the renderings of 2D or 3D boxes in pixel space. Our method, Ctrl-V, leverages modified and fine-tuned Stable Video Diffusion (SVD) models to solve both trajectory and video generation.
arXiv Detail & Related papers (2024-06-09T03:44:35Z)
Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model [62.51232333352754]
Ctrl-Adapter adds diverse controls to any image/video diffusion model through the adaptation of pretrained ControlNets. With six diverse U-Net/DiT-based image/video diffusion models, Ctrl-Adapter matches the performance of pretrained ControlNets on COCO.
arXiv Detail & Related papers (2024-04-15T17:45:36Z)
TrailBlazer: Trajectory Control for Diffusion-Based Video Generation [11.655256653219604]
Controllability in text-to-video (T2V) generation is often a challenge. We introduce the concept of keyframing, allowing the subject trajectory and overall appearance to be guided by both a moving bounding box and corresponding prompts. Despite the simplicity of the bounding box guidance, the resulting motion is surprisingly natural, with emergent effects including perspective and movement toward the virtual camera as the box size increases.
arXiv Detail & Related papers (2023-12-31T10:51:52Z)
Fine-grained Controllable Video Generation via Object Appearance and Context [74.23066823064575]
We propose fine-grained controllable video generation (FACTOR) to achieve detailed control. FACTOR aims to control objects' appearances and context, including their location and category. Our method achieves controllability of object appearances without finetuning, which reduces the per-subject optimization efforts for the users.
arXiv Detail & Related papers (2023-12-05T17:47:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.