Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free
Videos
- URL: http://arxiv.org/abs/2304.01186v2
- Date: Wed, 3 Jan 2024 09:10:12 GMT
- Title: Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free
Videos
- Authors: Yue Ma, Yingqing He, Xiaodong Cun, Xintao Wang, Siran Chen, Ying Shan,
Xiu Li, Qifeng Chen
- Abstract summary: In this work, we utilize datasets (i.e.,image pose pair and pose-free video) and the pre-trained text-to-image (T2I) model to obtain the pose-controllable character videos.
Specifically, in the first stage, only the keypoint-image pairs are used only for a controllable text-to-image generation.
In the second stage, we finetune the motion of the above network via a pose-free video dataset by adding the learnable temporal self-attention and reformed cross-frame self-attention blocks.
- Score: 107.65147103102662
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating text-editable and pose-controllable character videos have an
imperious demand in creating various digital human. Nevertheless, this task has
been restricted by the absence of a comprehensive dataset featuring paired
video-pose captions and the generative prior models for videos. In this work,
we design a novel two-stage training scheme that can utilize easily obtained
datasets (i.e.,image pose pair and pose-free video) and the pre-trained
text-to-image (T2I) model to obtain the pose-controllable character videos.
Specifically, in the first stage, only the keypoint-image pairs are used only
for a controllable text-to-image generation. We learn a zero-initialized
convolutional encoder to encode the pose information. In the second stage, we
finetune the motion of the above network via a pose-free video dataset by
adding the learnable temporal self-attention and reformed cross-frame
self-attention blocks. Powered by our new designs, our method successfully
generates continuously pose-controllable character videos while keeps the
editing and concept composition ability of the pre-trained T2I model. The code
and models will be made publicly available.
Related papers
- PoseCrafter: One-Shot Personalized Video Synthesis Following Flexible Pose Control [22.253448372833617]
PoseCrafter is a one-shot method for personalized video generation following the control of flexible poses.
Built upon Stable Diffusion and ControlNet, we carefully design an inference process to produce high-quality videos.
arXiv Detail & Related papers (2024-05-23T13:53:50Z) - Investigating the Effectiveness of Cross-Attention to Unlock Zero-Shot Editing of Text-to-Video Diffusion Models [52.28245595257831]
Cross-attention guidance can be a promising approach for editing videos.
We show that despite the limitations of current T2V models, cross-attention guidance can be a promising approach for editing videos.
arXiv Detail & Related papers (2024-04-08T13:40:01Z) - AnimateZero: Video Diffusion Models are Zero-Shot Image Animators [63.938509879469024]
We propose AnimateZero to unveil the pre-trained text-to-video diffusion model, i.e., AnimateDiff.
For appearance control, we borrow intermediate latents and their features from the text-to-image (T2I) generation.
For temporal control, we replace the global temporal attention of the original T2V model with our proposed positional-corrected window attention.
arXiv Detail & Related papers (2023-12-06T13:39:35Z) - LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation [44.220329202024494]
We present a few-shot-based tuning framework, LAMP, which enables text-to-image diffusion model Learn A specific Motion Pattern with 816 videos on a single GPU.
Specifically, we design a first-frame-conditioned pipeline that uses an off-the-shelf text-to-image model for content generation.
To capture the features of temporal dimension, we expand the pretrained 2D convolution layers of the T2I model to our novel temporal-spatial motion learning layers.
arXiv Detail & Related papers (2023-10-16T19:03:19Z) - Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video
Generators [70.17041424896507]
Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets.
We propose a new task of zero-shot text-to-video generation using existing text-to-image synthesis methods.
Our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.
arXiv Detail & Related papers (2023-03-23T17:01:59Z) - Tune-A-Video: One-Shot Tuning of Image Diffusion Models for
Text-to-Video Generation [31.882356164068753]
To reproduce the success of text-to-image (T2I) generation, recent works in text-to-video (T2V) generation employ massive dataset for dataset for T2V generation.
We propose Tune-A-Video is capable of producing temporally-coherent videos over various applications.
arXiv Detail & Related papers (2022-12-22T09:43:36Z) - Make-A-Video: Text-to-Video Generation without Text-Video Data [69.20996352229422]
Make-A-Video is an approach for translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V)
We design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules.
In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation.
arXiv Detail & Related papers (2022-09-29T13:59:46Z) - Self-Supervised Equivariant Scene Synthesis from Video [84.15595573718925]
We propose a framework to learn scene representations from video that are automatically delineated into background, characters, and animations.
After training, we can manipulate image encodings in real time to create unseen combinations of the delineated components.
We demonstrate results on three datasets: Moving MNIST with backgrounds, 2D video game sprites, and Fashion Modeling.
arXiv Detail & Related papers (2021-02-01T14:17:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.