TrackGo: A Flexible and Efficient Method for Controllable Video Generation
- URL: http://arxiv.org/abs/2408.11475v1
- Date: Wed, 21 Aug 2024 09:42:04 GMT
- Title: TrackGo: A Flexible and Efficient Method for Controllable Video Generation
- Authors: Haitao Zhou, Chuang Wang, Rui Nie, Jinxiao Lin, Dongdong Yu, Qian Yu, Changhu Wang,
- Abstract summary: We introduce TrackGo, a novel approach for conditional video generation.
TrackGo offers users with a flexible and precise mechanism for manipulating video content.
We also propose the TrackAdapter for control implementation.
- Score: 32.906496577618924
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent years have seen substantial progress in diffusion-based controllable video generation. However, achieving precise control in complex scenarios, including fine-grained object parts, sophisticated motion trajectories, and coherent background movement, remains a challenge. In this paper, we introduce TrackGo, a novel approach that leverages free-form masks and arrows for conditional video generation. This method offers users with a flexible and precise mechanism for manipulating video content. We also propose the TrackAdapter for control implementation, an efficient and lightweight adapter designed to be seamlessly integrated into the temporal self-attention layers of a pretrained video generation model. This design leverages our observation that the attention map of these layers can accurately activate regions corresponding to motion in videos. Our experimental results demonstrate that our new approach, enhanced by the TrackAdapter, achieves state-of-the-art performance on key metrics such as FVD, FID, and ObjMC scores. The project page of TrackGo can be found at: https://zhtjtcz.github.io/TrackGo-Page/
Related papers
- MotionBooth: Motion-Aware Customized Text-to-Video Generation [44.41894050494623]
MotionBooth is a framework designed for animating customized subjects with precise control over both object and camera movements.
We efficiently fine-tune a text-to-video model to capture the object's shape and attributes accurately.
Our approach presents subject region loss and video preservation loss to enhance the subject's learning performance.
arXiv Detail & Related papers (2024-06-25T17:42:25Z) - Image Conductor: Precision Control for Interactive Video Synthesis [90.2353794019393]
Filmmaking and animation production often require sophisticated techniques for coordinating camera transitions and object movements.
Image Conductor is a method for precise control of camera transitions and object movements to generate video assets from a single image.
arXiv Detail & Related papers (2024-06-21T17:55:05Z) - MOFA-Video: Controllable Image Animation via Generative Motion Field Adaptions in Frozen Image-to-Video Diffusion Model [78.11258752076046]
MOFA-Video is an advanced controllable image animation method that generates video from the given image using various additional controllable signals.
We design several domain-aware motion field adapters to control the generated motions in the video generation pipeline.
After training, the MOFA-Adapters in different domains can also work together for more controllable video generation.
arXiv Detail & Related papers (2024-05-30T16:22:22Z) - TrailBlazer: Trajectory Control for Diffusion-Based Video Generation [11.655256653219604]
Controllability in text-to-video (T2V) generation is often a challenge.
We introduce the concept of keyframing, allowing the subject trajectory and overall appearance to be guided by both a moving bounding box and corresponding prompts.
Despite the simplicity of the bounding box guidance, the resulting motion is surprisingly natural, with emergent effects including perspective and movement toward the virtual camera as the box size increases.
arXiv Detail & Related papers (2023-12-31T10:51:52Z) - TrackDiffusion: Tracklet-Conditioned Video Generation via Diffusion Models [75.20168902300166]
We propose TrackDiffusion, a novel video generation framework affording fine-grained trajectory-conditioned motion control.
A pivotal component of TrackDiffusion is the instance enhancer, which explicitly ensures inter-frame consistency of multiple objects.
generated video sequences by our TrackDiffusion can be used as training data for visual perception models.
arXiv Detail & Related papers (2023-12-01T15:24:38Z) - Propagate And Calibrate: Real-time Passive Non-line-of-sight Tracking [84.38335117043907]
We propose a purely passive method to track a person walking in an invisible room by only observing a relay wall.
To excavate imperceptible changes in videos of the relay wall, we introduce difference frames as an essential carrier of temporal-local motion messages.
To evaluate the proposed method, we build and publish the first dynamic passive NLOS tracking dataset, NLOS-Track.
arXiv Detail & Related papers (2023-03-21T12:18:57Z) - Video Relation Detection via Tracklet based Visual Transformer [12.31184296559801]
Video Visual Relation Detection (VidVRD) has received significant attention of our community over recent years.
We apply the state-of-the-art video object tracklet detection pipeline MEGA and deepSORT to generate tracklet proposals.
Then we perform VidVRD in a tracklet-based manner without any pre-cutting operations.
arXiv Detail & Related papers (2021-08-19T13:13:23Z) - EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content.
First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events.
Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z) - Fast Video Object Segmentation With Temporal Aggregation Network and
Dynamic Template Matching [67.02962970820505]
We introduce "tracking-by-detection" into Video Object (VOS)
We propose a new temporal aggregation network and a novel dynamic time-evolving template matching mechanism to achieve significantly improved performance.
We achieve new state-of-the-art performance on the DAVIS benchmark without complicated bells and whistles in both speed and accuracy, with a speed of 0.14 second per frame and J&F measure of 75.9% respectively.
arXiv Detail & Related papers (2020-07-11T05:44:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.